Skip to content

Commit bdd7e07

Browse files
committed
feat: add backing store for disk buffering of events
1 parent 40e7a0c commit bdd7e07

11 files changed

+2381
-72
lines changed

audittools/README.md

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
<!-- SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company
2+
SPDX-License-Identifier: Apache-2.0
3+
-->
4+
5+
# audittools
6+
7+
`audittools` provides a standard interface for generating and sending CADF (Cloud Auditing Data Federation) audit events to a RabbitMQ message broker.
8+
9+
## Certification Requirements (PCI DSS, SOC 2, and more)
10+
11+
As a cloud provider subject to strict audits (including PCI DSS and more), we must ensure the **completeness** and **integrity** of audit logs while maintaining service **availability**.
12+
13+
### Standard Production Configuration
14+
15+
**You MUST configure a persistent backing store (SQL or File-Based with PVC).**
16+
17+
* **Option 1 - SQL/Database Backing Store (Recommended)**:
18+
* Set `BackingStoreConfig` to use an existing PostgreSQL database
19+
* **Advantages**: No volume management, leverages existing database infrastructure
20+
* **Use Case**: Services that already have a database connection (most SAP services)
21+
22+
* **Option 2 - File-Based with PVC**:
23+
* Set `BackingStoreConfig` to a file-based backing store with a directory mounted from a PVC
24+
* **Use Case**: Services without database access but with volume support
25+
26+
* **Requirement**: This ensures that audit events are preserved even in double-failure scenarios (RabbitMQ outage + Pod crash/reschedule).
27+
* **Compliance**: Satisfies requirements for guaranteed event delivery and audit trail completeness.
28+
29+
### Non-Compliant Configurations
30+
31+
The following configurations are available for development or specific edge cases but are **NOT** recommended for production services subject to audit:
32+
33+
1. **File-Based with Ephemeral Storage (emptyDir)**:
34+
* *Risk*: Data loss if the Pod is rescheduled during a RabbitMQ outage.
35+
* *Status*: **Development / Testing Only**.
36+
37+
2. **In-Memory Backing Store**:
38+
* *Behavior*: Events are buffered in memory only. Data loss occurs if the Pod crashes during a RabbitMQ outage.
39+
* *Use Case*: Services without persistent volumes that prefer limited buffering over service downtime.
40+
* *Status*: **Development / Non-Compliant Environments Only**.
41+
42+
## Usage
43+
44+
### Basic Setup
45+
46+
To use `audittools`, you typically initialize an `Auditor` with your RabbitMQ connection details.
47+
48+
```go
49+
import "github.com/sapcc/go-bits/audittools"
50+
51+
func main() {
52+
// ...
53+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
54+
EnvPrefix: "MYSERVICE_AUDIT", // Configures env vars like MYSERVICE_AUDIT_RABBITMQ_URL
55+
})
56+
if err != nil {
57+
log.Fatal(err)
58+
}
59+
// ...
60+
}
61+
```
62+
63+
### Sending Events
64+
65+
```go
66+
event := cadf.Event{
67+
// ... fill in event details ...
68+
}
69+
auditor.Record(event)
70+
```
71+
72+
## Event Buffering with Backing Stores
73+
74+
`audittools` includes pluggable backing stores to ensure audit events are not lost if the RabbitMQ broker becomes unavailable. Events are temporarily buffered and replayed once the connection is restored.
75+
76+
### Backing Store Types
77+
78+
The backing store is configured via JSON and supports multiple implementations:
79+
80+
1. **SQL/Database Backing Store** (`type: "sql"`):
81+
* Persists events to a PostgreSQL database table
82+
* Survives pod restarts and database restarts
83+
* **Recommended for services that already have a database connection**
84+
* No filesystem volume management required
85+
* Leverages existing database infrastructure
86+
87+
2. **File-Based Backing Store** (`type: "file"`):
88+
* Persists events to local filesystem files
89+
* Survives pod restarts when using persistent volumes
90+
* Recommended for production services without existing database connections
91+
92+
3. **In-Memory Backing Store** (`type: "memory"`):
93+
* Buffers events in process memory
94+
* Does not survive pod restarts
95+
* Suitable for development or services without persistent volumes
96+
97+
### Configuration
98+
99+
#### Programmatic Configuration
100+
101+
**SQL/Database Backing Store**:
102+
```go
103+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
104+
EnvPrefix: "MYSERVICE_AUDIT",
105+
BackingStoreConfig: `{"type":"sql","params":{"dsn":"postgres://user:pass@localhost/mydb?sslmode=require","max_events":10000}}`,
106+
})
107+
```
108+
109+
**File-Based Backing Store**:
110+
```go
111+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
112+
EnvPrefix: "MYSERVICE_AUDIT",
113+
BackingStoreConfig: `{"type":"file","params":{"directory":"/var/lib/myservice/audit-buffer","max_total_size":1073741824}}`,
114+
})
115+
```
116+
117+
**In-Memory Backing Store**:
118+
```go
119+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
120+
EnvPrefix: "MYSERVICE_AUDIT",
121+
BackingStoreConfig: `{"type":"memory","params":{"max_events":1000}}`,
122+
})
123+
```
124+
125+
#### Environment Variable Configuration
126+
127+
* `MYSERVICE_AUDIT_BACKING_STORE`: JSON configuration string
128+
129+
Examples:
130+
131+
* SQL/Database: `{"type":"sql","params":{"dsn":"postgres://user:pass@localhost/mydb","max_events":10000}}`
132+
* File-based: `{"type":"file","params":{"directory":"/var/cache/audit","max_file_size":10485760,"max_total_size":1073741824}}`
133+
* In-memory: `{"type":"memory","params":{"max_events":1000}}`
134+
135+
If no `BackingStoreConfig` is provided, a default in-memory backing store with 1000 events capacity is used.
136+
137+
#### SQL/Database Parameters
138+
139+
* `dsn` (required): PostgreSQL connection string (e.g., `postgres://user:pass@host:5432/dbname?sslmode=require`)
140+
* `table_name` (optional): Table name for storing events (default: `audit_events`)
141+
* `batch_size` (optional): Number of events to read per batch (default: 100)
142+
* `max_events` (optional): Maximum total events to buffer (default: 10000)
143+
* `driver_name` (optional): SQL driver name (default: `postgres`)
144+
* `skip_migration` (optional): Skip automatic table creation (default: false)
145+
146+
**Database Setup**: The backing store will automatically create the required table unless `skip_migration` is true. For manual migration, see [`backing_store_sql_migration.sql`](backing_store_sql_migration.sql).
147+
148+
#### File-Based Parameters
149+
150+
* `directory` (required): Directory to store buffered event files
151+
* `max_file_size` (optional): Maximum size per file in bytes (default: 10 MB)
152+
* `max_total_size` (optional): Maximum total size of all files in bytes (no limit if not set)
153+
154+
#### In-Memory Parameters
155+
156+
* `max_events` (optional): Maximum number of events to buffer (default: 1000)
157+
158+
### Kubernetes Deployment
159+
160+
If running in Kubernetes, you have several options for configuring the backing store:
161+
162+
1. **SQL/Database Backing Store (Recommended for most SAP services)**:
163+
* Connect to an existing PostgreSQL database (e.g., service's main database).
164+
* **Pros**: Data survives Pod deletion and rescheduling. No volume management. Leverages existing database infrastructure.
165+
* **Cons**: Requires database access and table creation privileges.
166+
* **Use Case**: **Recommended** for services that already have a database connection. Ideal for audit compliance without volume management overhead.
167+
* **Configuration**: `{"type":"sql","params":{"dsn":"${DATABASE_URL}","max_events":10000}}`
168+
169+
2. **File-Based with Persistent Storage (PVC)**:
170+
* Mount a Persistent Volume Claim (PVC) and configure a file-based backing store pointing to that mount.
171+
* **Pros**: Data survives Pod deletion, rescheduling, and rolling updates. No database required.
172+
* **Cons**: Adds complexity (volume management, access modes, storage provisioning).
173+
* **Use Case**: Services without database access but with volume support.
174+
* **Configuration**: `{"type":"file","params":{"directory":"/mnt/pvc/audit-buffer","max_total_size":1073741824}}`
175+
176+
3. **File-Based with Ephemeral Storage (emptyDir)**:
177+
* Mount an `emptyDir` volume and configure a file-based backing store.
178+
* **Pros**: Simple, fast, no persistent volume management. Data survives container restarts within the same Pod.
179+
* **Cons**: Data is lost if the Pod is deleted or rescheduled.
180+
* **Use Case**: Suitable for non-critical environments or where occasional data loss during complex failure scenarios is acceptable.
181+
* **Configuration**: `{"type":"file","params":{"directory":"/tmp/audit-buffer"}}`
182+
183+
4. **In-Memory Backing Store**:
184+
* No volume mount or database required.
185+
* **Pros**: Simplest configuration, no storage management overhead.
186+
* **Cons**: Data is lost on any Pod restart or crash. Limited buffer capacity.
187+
* **Use Case**: Development environments or services that prefer limited buffering over any storage complexity.
188+
* **Configuration**: `{"type":"memory","params":{"max_events":1000}}` (or omit config entirely for default)
189+
190+
### Behavior
191+
192+
The system transitions through the following states to ensure zero data loss:
193+
194+
1. **Normal Operation**: Events are sent directly to RabbitMQ.
195+
2. **RabbitMQ Outage**: Events are written to the backing store (file or memory). The application continues without blocking.
196+
3. **Backing Store Full**: If the backing store reaches its capacity limit, writes fail and `auditor.Record()` **blocks**. This pauses the application to prevent data loss.
197+
* File-based: Controlled by `max_total_size` parameter
198+
* In-memory: Controlled by `max_events` parameter
199+
4. **Recovery**: A background routine continuously drains the backing store to RabbitMQ once it becomes available. New events are buffered during draining to prevent blocking.
200+
* **Note**: Strict chronological ordering is not guaranteed during recovery. New events are sent immediately if the connection is up, while old events from the backing store are drained asynchronously.
201+
202+
**Additional Details**:
203+
204+
* **Security (File-Based Only)**: The directory is created with `0700` permissions, and files with `0600`, ensuring only the service user can access the sensitive audit data.
205+
* **Capacity**:
206+
* File-based: The `max_total_size` limit is approximate and may be exceeded by up to one event's size (typically a few KB) due to the check-then-write sequence. Set the limit with appropriate headroom for your filesystem.
207+
* In-memory: The `max_events` limit is strictly enforced.
208+
* **Corrupted Event Handling (File-Based Only)**:
209+
* Corrupted events encountered during reads are written to dead-letter files (`audit-events-deadletter-*.jsonl`)
210+
* Dead-letter files contain metadata (timestamp, source file) and the raw corrupted data for investigation
211+
* The `corrupted_event` metric is incremented for monitoring
212+
* Source files are deleted after processing, even if all events were corrupted (after moving to dead-letter)
213+
* Dead-letter files should be monitored and investigated to identify data corruption issues
214+
215+
### Delivery Guarantees
216+
217+
This library aims to provide reliability guarantees similar to OpenStack's `oslo.messaging` (used by Keystone Middleware).
218+
219+
1. **At-Least-Once Delivery**: The primary guarantee is "at-least-once" delivery. Events are persisted to disk if the broker is unavailable and retried until successful.
220+
* *Note*: If a batch of events partially fails to send, the **entire batch** is retried. This ensures no data is lost but may result in duplicate events being sent to the broker. Consumers should implement idempotency using the event `ID` to handle these duplicates, similar to how `oslo.messaging` consumers are expected to behave.
221+
2. **Ordering**: Strict chronological ordering is **not guaranteed** during recovery. New events are sent immediately if the connection is up, while old events from the backing store are drained asynchronously. This aligns with the behavior of many distributed message queues where availability is prioritized over strict ordering during partitions.
222+
223+
### Metrics
224+
225+
The backing store exports the following Prometheus metrics:
226+
227+
**Common Metrics (All Backing Store Types)**:
228+
229+
* `audittools_backing_store_writes_total`: Total number of audit events written to the backing store.
230+
* `audittools_backing_store_reads_total`: Total number of audit events read from the backing store.
231+
* `audittools_backing_store_size_bytes`: Current size of the backing store.
232+
* File-based: Total size in bytes
233+
* In-memory: Number of events
234+
235+
**File-Based Backing Store Metrics**:
236+
237+
* `audittools_backing_store_files_count`: Current number of files in the backing store.
238+
* `audittools_backing_store_errors_total`: Total number of errors, labeled by operation:
239+
* `write_stat`: Failed to stat file during rotation check
240+
* `write_full`: Backing store is full (exceeds `max_total_size`)
241+
* `write_open`: Failed to open backing store file for writing
242+
* `write_marshal`: Failed to marshal event to JSON
243+
* `write_io`: Failed to write event to disk
244+
* `write_sync`: Failed to sync (flush) event to disk
245+
* `write_close`: Failed to close backing store file
246+
* `read_open`: Failed to open backing store file for reading
247+
* `read_scan`: Failed to scan backing store file
248+
* `corrupted_event`: Encountered corrupted event during read (written to dead-letter)
249+
* `deadletter_write`: Successfully wrote corrupted event to dead-letter file
250+
* `deadletter_write_failed`: Failed to write corrupted event to dead-letter file
251+
* `commit_remove`: Failed to remove file after successful processing

0 commit comments

Comments
 (0)