Audit Logging Footguns¶

Mistakes that leave you blind during incidents, violate compliance, or crash production systems.

1. No off-host log shipping¶

Audit logs stay on the same host. An attacker gains root, deletes /var/log/audit/audit.log, and their actions are gone forever. You have no evidence for the investigation.

Fix: Ship logs off-host immediately via audisp-remote, Filebeat, or Fluentd to a central SIEM. The central store should be append-only or immutable (S3 with object lock, Worm storage).

War story: In countless incident post-mortems, attackers with root access delete local audit logs as their first action. If your only audit trail is on the compromised host, the investigation is over before it starts. S3 Object Lock in compliance mode makes logs truly immutable — not even the root AWS account can delete them during the retention period.

2. Overly broad rules flooding the logs¶

You enable auditctl -a always,exit -F arch=b64 -S execve -k commands on a busy server. Every process fork generates an audit record. The audit log rotates every 10 minutes. Disk fills up. Actual security events are buried in noise.

Fix: Scope rules to what matters: privileged commands, config file changes, authentication events. Do not audit every syscall on production servers unless you have the storage and indexing to handle it.

Default trap: A single execve audit rule on a busy server can generate 10,000+ events per minute. Each event includes the full command line, environment, and working directory. At ~500 bytes per event, that is 300MB/hour of audit logs before compression.

3. Setting `disk_full_action = IGNORE`¶

Audit log fills the disk. With disk_full_action = IGNORE, auditd silently drops all subsequent audit events. You think you have coverage but you have a gap of hours or days with zero records. Compliance auditors find it.

Fix: Set disk_full_action = HALT for compliance-critical systems (the machine stops rather than running unaudited). For non-critical systems, use SYSLOG to at least alert. Monitor audit disk usage.

4. Not making rules immutable in production¶

Rules are loaded but not locked. An attacker with root runs auditctl -D to delete all rules. From that point forward, nothing is logged. They operate freely without a trace.

Fix: Add -e 2 as the last line in your audit rules file. This makes rules immutable — they cannot be changed without a reboot. Test rules thoroughly before enabling immutability.

Under the hood: -e 2 sets the kernel audit configuration to "locked." Even root cannot modify, add, or delete rules — only a reboot resets the lock. This is critical for compliance (PCI DSS 10.5.2 requires protection of audit trails). But test your rules exhaustively first, because fixing a bad rule requires a reboot.

5. Missing Kubernetes API server audit policy¶

The API server has no --audit-policy-file flag. No audit events are logged. Someone deletes a production namespace, exfils secrets, or modifies RBAC — and you have zero records of who did it or when.

Fix: Deploy an audit policy file with the API server. At minimum, log Metadata level for all requests and RequestResponse for secrets, RBAC, and pod exec/attach.

Gotcha: Without --audit-policy-file, the Kubernetes API server produces zero audit events by default. This is not a misconfiguration warning — it is silent by design. Someone can kubectl delete namespace production and there is no record of who did it or when.

6. Logging secrets at RequestResponse level¶

Your Kubernetes audit policy logs everything at RequestResponse level, including secrets. The audit log now contains base64-encoded database passwords, API keys, and TLS certificates. Anyone with access to the audit log has access to all secrets.

Fix: Log secrets at Metadata level only — you get who accessed which secret and when, without the secret content. Use RequestResponse only for pod exec/attach and RBAC changes.

7. Never testing audit rules before deployment¶

You push new audit rules to /etc/audit/rules.d/ and run augenrules --load. A syntax error in one rule causes all rules to fail to load. The system runs with zero audit coverage until someone notices.

Fix: Test rules with auditctl on a staging system first. After loading, run auditctl -l and auditctl -s to verify rules are active and no errors occurred. Automate this check.

8. Audit log retention too short¶

Your audit logs rotate with 4 weeks of retention. A compliance audit requests records from 6 months ago. You do not have them. The auditor flags a finding, and your compliance certification is at risk.

Fix: Check your compliance framework's retention requirements (PCI DSS: 1 year, HIPAA: 6 years, SOC 2: 1 year). Configure retention accordingly. Ship to long-term storage (S3 lifecycle policies) for retention beyond local disk capacity.

Remember: PCI DSS requires 1 year of audit logs (3 months immediately available). HIPAA requires 6 years. SOC 2 requires 1 year. FedRAMP requires 90 days online + 1 year archived. Know your framework's requirements before setting retention policies — an auditor finding a gap can cost more than the storage.

9. Not correlating audit events across systems¶

You have auditd on Linux hosts and API server audit logs on Kubernetes, but they are in different systems with different timestamp formats. When investigating an incident, you cannot trace a user's actions from SSH login through kubectl commands to API server operations.

Fix: Centralize all audit sources into one SIEM. Normalize timestamps to UTC. Use consistent user identifiers across systems. Build cross-source queries for incident investigation.

10. Ignoring the audit backlog¶

auditctl -s shows lost: 15230. The kernel audit subsystem dropped events because the backlog queue was full. You have gaps in your audit trail that you cannot recover.

Fix: Increase backlog_limit in /etc/audit/auditd.conf (default 8192, consider 32768 for busy systems). Monitor the lost counter. If you are still losing events, reduce rule scope or add audit infrastructure capacity.

Debug clue: auditctl -s shows backlog (current queue depth) and lost (cumulative drops). If lost is non-zero, you have gaps. If backlog is consistently near backlog_limit, you need to increase the limit or reduce rule volume. Add auditctl -s | grep lost to your monitoring.