Compliance & Audit Automation - Street-Level Ops¶

What experienced compliance engineers know about surviving audits and keeping systems hardened without losing your mind.

Quick Diagnosis Commands¶

# Quick STIG compliance check — how many findings?
oscap xccdf eval \
  --profile xccdf_org.ssgproject.content_profile_stig \
  --results /tmp/stig-quick.xml \
  /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml 2>/dev/null
echo "Failures: $(xmllint --xpath 'count(//rule-result[result="fail"])' /tmp/stig-quick.xml)"
echo "Passes:   $(xmllint --xpath 'count(//rule-result[result="pass"])' /tmp/stig-quick.xml)"

# Check if SSH is hardened (quick spot check)
sshd -T 2>/dev/null | grep -iE 'permitrootlogin|passwordauthentication|protocol|x11forwarding'

# Check if auditd is running and configured
systemctl is-active auditd && auditctl -l | head -20

# List world-writable files (common CAT II finding)
find / -xdev -type f -perm -002 -not -path "/proc/*" -not -path "/sys/*" 2>/dev/null

# Check SUID/SGID binaries against known-good list
find / -xdev -type f \( -perm -4000 -o -perm -2000 \) 2>/dev/null | sort

# InSpec quick scan against CIS baseline
inspec exec supermarket://dev-sec/linux-baseline --reporter cli 2>/dev/null | tail -20

# Check SELinux status (CAT I on RHEL)
getenforce && sestatus | head -5

# Verify password complexity requirements
grep -E 'minlen|minclass|dcredit|ucredit|lcredit|ocredit' /etc/security/pwquality.conf

One-liner: Compliance is not a state you achieve — it is a process you automate. If your compliance check is manual, it is already drifting.

Gotcha: OpenSCAP Scan Hangs on Large Fleets¶

You run OpenSCAP across 200 servers via Ansible. Twenty servers hang on the filesystem scan (slow NFS mounts, huge /tmp). The entire Ansible run blocks for hours.

Fix: Set timeouts on the oscap command:

# Timeout the scan at 10 minutes
timeout 600 oscap xccdf eval \
  --profile xccdf_org.ssgproject.content_profile_stig \
  --results results.xml \
  /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml

# In Ansible, use async + poll
- name: Run compliance scan
  command: >
    oscap xccdf eval
    --profile xccdf_org.ssgproject.content_profile_stig
    --results /tmp/results.xml
    /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml
  async: 600
  poll: 30
  ignore_errors: yes

Gotcha: STIG Remediation Breaks the Application¶

You apply the full STIG remediation playbook. Control V-230526 sets umask 077 globally. Your web application can no longer read shared temp files. Your monitoring agent can't write to its socket. Three services are now broken.

Fix: Never blind-apply full STIG remediation. Process:

1. Scan first (read-only). Catalog all findings.
2. Review each finding against your application requirements.
3. Categorize: "safe to remediate" vs "needs application testing" vs "requires waiver."
4. Remediate in waves: safe items first, then test-required items in staging.
5. Document waivers for controls that conflict with application needs.

War story: A team applied the full CIS Level 2 benchmark to their database servers without testing. One control disabled core dumps. The database relied on core dumps for crash recovery diagnostics. When a crash occurred two weeks later, there was no dump to analyze, turning a 2-hour diagnosis into a 2-day guessing game.

Gotcha: Compliance Drift Between Scans¶

Monday scan: 98% compliant. Friday: someone manually installed a package, changed an SSH config, and added a user. Nobody scanned. The auditor arrives the following Monday and finds 15 new findings.

Fix: Continuous compliance monitoring. Don't scan weekly — scan on every change:

Three-layer drift detection:

Layer 1: File integrity monitoring (AIDE/OSSEC)
  - Detects config file changes within minutes
  - Alert on: /etc/ssh/*, /etc/audit/*, /etc/pam.d/*

Layer 2: Scheduled full scan (cron, 4x daily)
  - Catches anything file integrity missed
  - Reports delta from last scan, not just current state

Layer 3: Event-driven scan (triggered by changes)
  - Ansible callback: after any playbook run, trigger scan on changed hosts
  - Package install hook: after yum/apt, trigger relevant control checks

Gotcha: InSpec Profile Version Mismatch¶

Your CI pipeline runs InSpec profile v3.2. Production is scanned with v3.0. Two controls were added in v3.2. CI passes. Production shows two new failures that CI never caught. The auditor finds the discrepancy.

Fix: Pin profile versions in both CI and production. Store profiles in git with version tags. Promote profile versions through environments just like application code:

# Pin profile version in Inspec.lock or your automation
inspec exec https://github.com/org/compliance-profiles/archive/v3.2.tar.gz

Gotcha: Evidence That Proves the Wrong Thing¶

You collect a scan report from staging and present it as production evidence. The auditor checks the hostname in the XML. It says "staging-web-01." Your evidence is now worthless and the auditor's trust in your process is damaged.

Fix: Evidence collection must include: hostname, IP address, timestamp, profile version, and environment tag.

Remember: Good compliance evidence has five properties: timestamped, machine-identified, version-tagged, immutable, checksummed. Mnemonic: TMVIC — Time, Machine, Version, Immutable, Checksum. If any are missing, an auditor can challenge the evidence. Automate it so humans can't accidentally mix environments. Checksum evidence bundles and store them immutably (S3 with versioning, not a file share someone can edit).

Pattern: The Compliance Sprint¶

Dedicated remediation effort before a scheduled audit:

8 weeks before audit:
  Week 1: Full scan of all in-scope systems. Baseline current state.
  Week 2: Categorize findings — auto-fix vs manual vs waiver.
  Week 3-4: Automated remediation of safe items. Re-scan.
  Week 5: Manual remediation of complex items. Test in staging.
  Week 6: Re-scan. Write waiver requests for items that can't be fixed.
  Week 7: Final scan. Evidence collection. Package evidence bundles.
  Week 8: Internal review. Fix any last gaps. Prepare for auditor walkthrough.

Key rule: no new infrastructure changes in the final 2 weeks.
Change freeze protects the evidence you just collected.

Pattern: The Waiver Document¶

Some STIG controls genuinely conflict with your application. You need a formal waiver.

STIG Waiver Request Template:

  Finding ID: V-230526
  Severity: CAT II
  Title: System default umask must be 077
  Status: Waiver Requested

  Current Configuration: umask 022
  Required by STIG: umask 077

  Business Justification:
  The application (grokdevops) requires umask 022 to allow
  the monitoring agent (running as prometheus user) to read
  application log files written by the appuser account.
  Setting umask 077 breaks monitoring, which degrades our
  ability to detect security incidents.

  Compensating Controls:
  1. Application logs are restricted to appuser:monitoring group
  2. File ACLs limit access to only required service accounts
  3. Audit logging tracks all file access to the log directory

  Risk Acceptance: [Signature] [Date]
  Review Date: [6 months from now]

Pattern: Ansible + OpenSCAP Continuous Loop¶

The continuous compliance loop:

  ┌─────────────────────────┐
  │ 1. Scan (OpenSCAP)       │
  │    Produce results.xml   │
  └──────────┬──────────────┘
             │
  ┌──────────▼──────────────┐
  │ 2. Parse (Python/jq)     │
  │    Extract failed rules  │
  └──────────┬──────────────┘
             │
  ┌──────────▼──────────────┐
  │ 3. Remediate (Ansible)   │
  │    Only failed controls  │
  └──────────┬──────────────┘
             │
  ┌──────────▼──────────────┐
  │ 4. Re-scan (verify)      │
  │    Confirm remediation   │
  └──────────┬──────────────┘
             │
  ┌──────────▼──────────────┐
  │ 5. Archive evidence      │
  │    Timestamped, signed   │
  └──────────┬──────────────┘
             │
             └──────▶ Repeat on schedule (daily/weekly)

Emergency: Auditor Finds Critical Finding in Production¶

Auditor discovers a CAT I finding (e.g., auditd not running on 5 servers) during a live audit.

1. Don't panic. Don't make excuses. Acknowledge the finding.
2. Fix it NOW — in front of the auditor if possible.
   systemctl start auditd && systemctl enable auditd
3. Investigate: why was it off? Was it ever on? Check:
   - systemctl status auditd (last state change)
   - journalctl -u auditd (recent errors)
   - Ansible run logs (was it supposed to be enabled?)
4. Determine scope: are other servers affected?
   ansible all -m shell -a "systemctl is-active auditd" | grep -v CHANGED
5. Remediate fleet-wide immediately.
6. Document the finding, root cause, and remediation timeline.
7. Show the auditor your remediation evidence + re-scan results.
8. Add the control to your continuous monitoring so it can never drift again.

Emergency: Compliance Pipeline Blocking All Deploys¶

Your compliance gate is rejecting every build. A profile update added a new control that nothing passes. Feature work is completely blocked.

1. Identify the blocking control:
   jq '.profiles[].controls[] | select(.results[].status == "failed")
   | {id, title}' results.json

2. Assess: is this a real security issue or a profile version bump?
   - Real issue: fix it (usually a config change, not code)
   - Profile bump: pin the previous profile version temporarily

3. Short-term: add the control to an exception list (with ticket to fix)
   # In your pipeline config:
   WAIVED_CONTROLS="new-control-123"
   inspec exec profile --waiver-file waivers.yml

4. Long-term: fix the underlying issue and remove the waiver.
   Track waiver removal as a P1 task with a 2-week deadline.

5. Never disable the compliance gate entirely.
   That's how you end up with a "we'll turn it back on later" situation
   that lasts 6 months.

Default trap: InSpec --waiver-file silently skips waived controls without logging. Always ensure your waiver YAML includes expiration_date so waivers auto-expire, and pipe waiver results to your compliance dashboard so skipped controls are still visible.