Compliance & Audit Automation Footguns¶
Mistakes that get your systems flagged, your audits failed, and your compliance posture exposed as theater.
1. Running remediation without scanning first¶
You find a STIG remediation Ansible role on GitHub. You apply it to your entire fleet without running a scan first. The role changes 47 settings. Three of them break your application. You don't know which three because you didn't capture the before state.
Fix: Always scan first. Capture the baseline. Compare before and after. Remediate in stages: safe items, then risky items in staging, then production. Never apply a compliance role without testing it against your application.
War story: A team applied a DISA STIG Ansible role to their web servers without pre-scanning. The role set
net.ipv4.ip_forward=0, breaking their Docker overlay networking. All containerized services went down because containers couldn't route traffic through the host.
2. Compliance evidence from the wrong environment¶
You submit a scan report to the auditor. They open the XML and see hostname: staging-app-02. Your production evidence is actually from staging. The auditor now questions everything you've submitted.
Fix: Automate evidence collection with environment tagging. Each evidence artifact must contain hostname, IP, environment, timestamp, and profile version. Script it so humans can't accidentally grab the wrong file.
3. Auto-remediation with no rollback plan¶
Your continuous compliance pipeline detects a drift (someone changed SSH config) and auto-remediates. The change was intentional — a temporary fix for a production issue. The auto-remediation reverts it, and the production issue returns. Nobody knows why because the remediation ran silently.
Fix: Auto-remediation must notify before or after acting. Critical: log every auto-remediation action, alert on it, and provide a one-command rollback. For sensitive systems, auto-detect drift but require human approval to remediate.
4. Treating the scan report as the truth¶
Your OpenSCAP scan shows 100% pass. You declare victory. But the scan profile doesn't cover your application layer, your container images, your cloud IAM policies, or your network segmentation. You're 100% compliant with 30% of the requirements.
Fix: Understand what your scan covers and what it doesn't. CIS Linux covers OS hardening — not application security. You need multiple scan profiles: OS baseline, container image baseline, cloud configuration, and application-specific controls. Map each requirement to a scan.
5. Waivers without expiration dates¶
You file a waiver for a STIG finding because "the application needs it." The waiver has no review date. Three years later, the application has been rewritten twice. The waiver is still active. Nobody knows if it's still needed. The auditor asks when it was last reviewed. You don't have an answer.
Fix: Every waiver must have an expiration date (6 months max). At expiration, re-evaluate: is the waiver still needed? Has the application changed? Can the control now be met? Automate waiver expiration tracking and alert 30 days before review date.
6. Scanning only at build time¶
Your CI pipeline runs InSpec on the container image. It passes. You deploy. Six weeks later, the production environment has drifted: someone mounted a host path, changed a kernel parameter, installed a debug tool. Your build-time scan knows nothing about this.
Fix: Scan at three points: build time (image compliance), deploy time (runtime configuration), and continuously in production (drift detection). Build-time scanning is necessary but not sufficient.
7. Hardcoded exceptions in the scan profile¶
Your scan has 12 controls with skip or not_applicable hardcoded in the profile. Nobody remembers why. When the auditor asks about each one, you can't justify them. The auditor treats them as findings.
Fix: Every exception must be in a separate waiver file (not in the profile itself) with documented justification. The waiver file is versioned in git. The commit message explains why each exception was added. Auditors love git history.
8. Compliance gate that nobody can override¶
Your compliance pipeline blocks all deploys on any failure. A CAT III finding (missing banner text) blocks a critical security patch from deploying. The security patch fixes a CVE that's actively being exploited. You can't deploy the fix because of a login banner.
Fix: Compliance gates need a documented break-glass procedure. Severity-based override: CAT III findings can be overridden with team lead approval. CAT I findings require VP sign-off. Log every override with who approved it and why. The override is not a bypass — it's a tracked exception.
Gotcha: DISA STIG severity categories: CAT I = high (exploitable, immediate risk), CAT II = medium (potential for data loss), CAT III = low (administrative/cosmetic). Blocking deploys on CAT III (like missing login banners) while a CAT I CVE patch waits in the queue is a compliance process defeating its own purpose.
9. One compliance profile for all server roles¶
You apply the same STIG profile to web servers, database servers, jump boxes, and monitoring systems. The web server STIG requires Apache controls. Your database server doesn't run Apache. The scan reports 20 "not applicable" findings per server, burying the real findings in noise.
Fix: Build role-specific profiles. Start with a common OS baseline, then layer role-specific controls on top. Web servers get the Apache controls. Database servers get the PostgreSQL controls. Monitoring servers get minimal hardening focused on data integrity.
10. Presenting compliance percentage without context¶
You tell leadership "we're 94% STIG compliant." That sounds great. But the 6% you're missing is all CAT I findings: no disk encryption, no audit logging, root SSH enabled. Percentage without severity weighting is misleading.
Fix: Report compliance by severity category, not just percentage. "100% CAT I, 97% CAT II, 88% CAT III" tells a real story. One unfixed CAT I finding is worse than fifty unfixed CAT III findings. Weight your reporting accordingly.
Remember: The compliance percentage game: a system with 200 controls, 2 CAT I failures, and 10 CAT III failures is "94% compliant" — sounds great. But those 2 CAT I failures might be "no disk encryption" and "root SSH enabled." Severity-weighted reporting prevents this false confidence.