Compliance as Code: Automating the Auditor
- lesson
- compliance-frameworks
- policy-as-code
- opa/rego
- kyverno
- inspec
- audit-logging
- aws-config
- ci/cd
- cis-benchmarks ---# Compliance as Code — Automating the Auditor
Topics: compliance frameworks, policy as code, OPA/Rego, Kyverno, InSpec, audit logging, AWS Config, CI/CD, CIS Benchmarks Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
The auditor is coming in 30 days.
Your company processes payments (PCI-DSS scope), stores health records for a partner (HIPAA-adjacent), and just signed an enterprise customer who requires SOC 2 Type II. The VP of Engineering just forwarded you an email from the compliance team: "Please provide evidence of current security controls across all production systems."
You open the shared drive. There's a spreadsheet called
compliance-tracker-FINAL-v3-ACTUAL-FINAL.xlsx. The last update was six months ago.
Column F says "SSH hardened" for 47 servers. Column G says "Reviewed by: Dave." Dave left
the company in October.
You have 30 days. You can spend them filling out spreadsheets and taking screenshots, or you can build a system that produces compliance evidence continuously, automatically, and honestly. By the end of this lesson, you'll have the tools and patterns to do the second thing.
We'll cover: - Why manual compliance fails (and always will) - The major compliance frameworks and what they actually require - Policy as code with OPA/Rego and Kyverno - Infrastructure compliance with InSpec and AWS Config - Audit logging from Linux hosts through Kubernetes to the cloud - How to wire all of this into your CI/CD pipeline - Automating evidence collection so the next audit is a non-event
Part 1: Why Spreadsheets Lie¶
Before we build anything, let's understand what we're replacing.
War Story: A fintech startup passed their first SOC 2 audit using a manually maintained compliance spreadsheet. The spreadsheet showed all 200+ controls as "met." When a new engineer ran the first automated scan six months later, 34 controls were failing — SSH root login was enabled on 12 servers, auditd wasn't running on 8 hosts, and three S3 buckets were publicly readable. The spreadsheet had been copy-pasted from the previous quarter without re-checking a single control. Point-in-time manual audits are fiction — they tell you what was true once, not what's true now.
Manual compliance fails for three structural reasons:
| Problem | What happens | Real cost |
|---|---|---|
| Point-in-time snapshots | You scan on Monday, someone changes SSH config on Tuesday | Drift goes undetected until the next audit |
| Human evidence collection | Screenshots, copy-pasted configs, "I checked it" | Evidence is stale, incomplete, or from the wrong environment |
| No enforcement | Policy says "no privileged containers" but nothing stops them | Violations accumulate between audits |
The compliance maturity ladder:
Level 0: Manual Spreadsheets, screenshots, "trust me"
Level 1: Scripted One-off scan scripts, run before audit
Level 2: Scheduled Cron jobs run scans weekly, reports emailed
Level 3: Pipeline Compliance checks in CI/CD, blocks bad builds
Level 4: Continuous Real-time monitoring + auto-remediation
Level 5: Codified Compliance profiles versioned in git, auditable diffs
Most organizations are at Level 0-1. This lesson gets you to Level 3-4.
Mental Model: Think of compliance like tests in your codebase. You wouldn't ship code that was "manually verified to work last quarter." You run tests on every commit. Compliance should work the same way — automated, continuous, and blocking.
Part 2: The Framework Landscape (What Auditors Actually Want)¶
Different frameworks care about different things, but they share a common core. Learn the core once and you can map it to any framework.
| Framework | Who needs it | Focus | Key requirement |
|---|---|---|---|
| SOC 2 | SaaS companies | Trust service criteria | Prove controls work over 6-12 months |
| PCI-DSS | Anyone touching payment cards | Cardholder data protection | Quarterly scans, daily log review |
| HIPAA | Healthcare data handlers | Protected health information | Encrypt PHI at rest and in transit |
| CIS Benchmarks | Everyone | OS/container/cloud hardening | Configuration baselines per platform |
| NIST 800-53 | Government contractors | Security controls catalog | Formal control selection and assessment |
Trivia: SOC 2 evolved from SAS 70, an auditing standard created for accounting firms in the 1990s. It was never designed for software companies. Its adaptation to tech was driven by cloud computing — enterprise customers needed assurance that SaaS providers were handling data responsibly. This origin explains many of its awkward requirements around "logical access" and "change management" that feel like they were written for a mainframe era — because they were.
The five controls that appear in every framework:
- Access management — who can access what, and is it the minimum needed?
- Encryption — data at rest and in transit
- Audit logging — who did what, when, from where
- Change management — how do changes get approved and tracked?
- Incident response — what happens when something goes wrong?
Remember: The mnemonic AEACI (pronounced "ACE-ee") — Access, Encryption, Audit, Change, Incident. Master these five control families and you've covered 80% of any compliance framework. The remaining 20% is framework-specific detail.
SOC 2 Type I vs Type II¶
This trips people up constantly. Type I evaluates control design at a single point in time — are the right controls defined? Type II evaluates both design and operating effectiveness over a period (typically 6-12 months) — are controls actually working?
Type II is what enterprise customers require. It's the difference between "we have a lock on the door" (Type I) and "here are 12 months of access logs proving the lock works and only authorized people get in" (Type II). Automated evidence collection is what makes Type II achievable without a dedicated compliance team.
Flashcard Check #1¶
Cover the answers and test yourself.
| Question | Answer |
|---|---|
| What's the difference between SOC 2 Type I and Type II? | Type I = control design at a point in time. Type II = design + effectiveness over 6-12 months. |
| Name three of the five universal control families. | Access management, encryption, audit logging, change management, incident response. |
| What compliance maturity level involves CI/CD compliance gates? | Level 3: Pipeline. |
| Why do point-in-time compliance checks always lie? | Systems drift between checks. A Monday scan says nothing about Friday's reality. |
Part 3: Policy as Code — Teaching Machines to Say "No"¶
Compliance documents say things like "all containers must run as non-root." That's a policy. If it lives only in a PDF, it's a suggestion. If it lives in code that blocks non-compliant deployments, it's enforced.
Two tools dominate this space in Kubernetes: OPA Gatekeeper (Rego language) and Kyverno (YAML-native).
Name Origin: Rego is pronounced "ray-go." It's a declarative query language purpose-built for policy by the OPA project (Styra, donated to CNCF). OPA itself stands for Open Policy Agent. Kyverno comes from the Greek word "kyberno" (κυβερνώ), meaning "to govern" — the same root as "Kubernetes" (κυβερνήτης, "helmsman") and "cybernetics." Both tools govern the cluster, just in different languages.
OPA/Rego: Block Privileged Containers¶
Here's a real OPA Gatekeeper policy. First, you define a ConstraintTemplate (the
reusable logic):
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sdisallowprivileged
spec:
crd:
spec:
names:
kind: K8sDisallowPrivileged
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sdisallowprivileged
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
container.securityContext.privileged == true
msg := sprintf("Privileged container not allowed: %v", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.initContainers[_]
container.securityContext.privileged == true
msg := sprintf("Privileged init container not allowed: %v", [container.name])
}
Then a Constraint (where and how to apply it):
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDisallowPrivileged
metadata:
name: no-privileged-containers
spec:
enforcementAction: deny
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces:
- kube-system
- gatekeeper-system
Let's break down the Rego:
| Rego element | What it does |
|---|---|
package k8sdisallowprivileged |
Namespace for this policy |
violation[{"msg": msg}] |
A partial rule — collects all violations into a set |
input.review.object.spec.containers[_] |
Iterates over every container in the pod spec |
container.securityContext.privileged == true |
Checks the privileged flag |
sprintf("...", [container.name]) |
Builds a human-readable violation message |
Multiple violation blocks are OR'd — if any block produces a message, the admission
is denied. Conditions within a block are AND'd — all must be true for that block to
fire.
Gotcha: Always exclude
kube-systemandgatekeeper-systemfrom your constraints. If a constraint blocks rescheduling of kube-dns or metrics-server, your cluster's DNS and monitoring break. A policy that takes down the cluster is worse than no policy.
Kyverno: The YAML-Native Alternative¶
Same policy in Kyverno — no new language to learn:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged
spec:
validationFailureAction: Enforce
rules:
- name: no-privileged
match:
any:
- resources:
kinds:
- Pod
exclude:
any:
- resources:
namespaces:
- kube-system
validate:
message: "Privileged containers are not allowed."
pattern:
spec:
containers:
- securityContext:
privileged: "!true"
| Feature | OPA Gatekeeper | Kyverno |
|---|---|---|
| Policy language | Rego (custom DSL) | YAML (native K8s) |
| Learning curve | Steep (Rego is unique) | Low (just YAML) |
| Mutation support | Limited | First-class |
| Resource generation | No | Yes (auto-create NetworkPolicies, etc.) |
| Best for | Complex logic, multi-platform orgs | K8s-native teams wanting fast adoption |
Mental Model: OPA is a general-purpose policy engine that happens to work with Kubernetes. Kyverno is a Kubernetes policy engine, period. If you only need K8s admission control, Kyverno gets you there faster. If you also need to enforce policy on Terraform plans, API authorization, and CI/CD pipelines with the same language, OPA is the better investment.
Testing Policies Before They Bite¶
Both tools support an audit/dry-run mode. The rollout pattern:
1. Deploy policy in Audit mode (Gatekeeper: enforcementAction: dryrun)
2. Wait 1-2 weeks. Collect violations without blocking anything.
3. Review: how many existing resources violate the policy?
4. Fix the violations.
5. Switch to Enforce mode (Gatekeeper: enforcementAction: deny).
6. Never skip step 1 — enforcing immediately breaks existing workloads.
For OPA, test Rego with unit tests before deploying:
package k8sdisallowprivileged_test
test_deny_privileged {
violation with input as {
"review": {"object": {"spec": {"containers": [
{"name": "evil", "securityContext": {"privileged": true}}
]}}}
}
}
test_allow_unprivileged {
not violation with input as {
"review": {"object": {"spec": {"containers": [
{"name": "safe", "securityContext": {"privileged": false}}
]}}}
}
}
# Run tests
opa test ./policies/ -v
# Syntax and type checking
opa check --strict ./policies/
# Test Gatekeeper ConstraintTemplates locally
gator test -f ./constraints/
Part 4: Infrastructure Compliance — Scanning What Exists¶
Policy engines prevent bad things from being created. Compliance scanning finds bad things that already exist. You need both.
InSpec: CIS Linux Benchmark Profile¶
InSpec expresses compliance controls as human-readable Ruby code. Here's a profile that checks key CIS Linux benchmark controls:
# controls/cis_linux.rb
control 'cis-1.1.1' do
impact 1.0
title 'Ensure mounting of cramfs is disabled'
desc 'Removing support for unnecessary filesystems reduces attack surface'
describe kernel_module('cramfs') do
it { should_not be_loaded }
it { should be_disabled }
end
end
control 'cis-5.2.1' do
impact 1.0
title 'Ensure SSH root login is disabled'
desc 'Disabling root login forces administrators to authenticate with personal accounts'
describe sshd_config do
its('PermitRootLogin') { should eq 'no' }
end
end
control 'cis-5.2.6' do
impact 0.7
title 'Ensure SSH idle timeout is configured'
describe sshd_config do
its('ClientAliveInterval') { should cmp <= 300 }
its('ClientAliveCountMax') { should cmp <= 3 }
end
end
control 'cis-4.1.1' do
impact 1.0
title 'Ensure auditd is installed and running'
describe service('auditd') do
it { should be_installed }
it { should be_enabled }
it { should be_running }
end
end
# Run against a local system
inspec exec ./cis-linux-profile/
# Run against a remote host via SSH
inspec exec ./cis-linux-profile/ -t ssh://admin@10.0.1.50 -i ~/.ssh/audit_key
# Run against a Docker container
inspec exec ./cis-linux-profile/ -t docker://my-container-id
# Use a community profile from the InSpec Supermarket
inspec exec supermarket://dev-sec/linux-baseline
# Multiple output formats (CLI + JSON for automation + HTML for humans)
inspec exec ./cis-linux-profile/ --reporter cli json:results.json html:report.html
Trivia: Chef launched InSpec in 2015 as the first widely adopted compliance-as-code framework. Before InSpec, compliance checks were either manual checklists or opaque scanning tools that produced PDFs. InSpec made each compliance control a readable, executable test — the same paradigm shift that unit testing brought to application code.
CIS Benchmarks: Level 1 vs Level 2¶
CIS Benchmarks define two levels of hardening:
| Level | Applies to | Impact | Example controls |
|---|---|---|---|
| Level 1 | Every server | Minimal perf impact | Disable unused filesystems, SSH hardening, password policy |
| Level 2 | High-security environments | May impact usability | SELinux enforcing, audit all privileged commands, USB disabled |
Gotcha: Don't apply CIS Level 2 everywhere. On a development VM, SELinux enforcing and USB-disable controls create friction without meaningful security benefit. Use Level 1 as the universal baseline and Level 2 for production and systems handling sensitive data. Auditors respect tiered profiles more than blanket application.
AWS Config Rules: Cloud-Native Compliance¶
AWS Config continuously evaluates your resources against rules. Some key compliance rules:
{
"ConfigRuleName": "s3-bucket-public-read-prohibited",
"Source": {
"Owner": "AWS",
"SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
},
"Scope": {
"ComplianceResourceTypes": ["AWS::S3::Bucket"]
}
}
AWS provides 300+ managed Config rules. The compliance-critical ones:
| AWS Config Rule | What it checks | Framework mapping |
|---|---|---|
s3-bucket-public-read-prohibited |
No publicly readable S3 buckets | PCI-DSS, SOC 2, CIS AWS |
encrypted-volumes |
EBS volumes are encrypted | HIPAA, PCI-DSS |
cloudtrail-enabled |
CloudTrail is active | All frameworks |
iam-root-access-key-check |
Root account has no access keys | CIS AWS Level 1 |
rds-storage-encrypted |
RDS instances are encrypted | HIPAA, PCI-DSS |
multi-region-cloudtrail-enabled |
CloudTrail covers all regions | SOC 2, PCI-DSS |
# Check compliance status via CLI
aws configservice describe-compliance-by-config-rule \
--config-rule-names s3-bucket-public-read-prohibited \
--query 'ComplianceByConfigRules[].Compliance.ComplianceType'
# Get non-compliant resources
aws configservice get-compliance-details-by-config-rule \
--config-rule-name s3-bucket-public-read-prohibited \
--compliance-types NON_COMPLIANT \
--query 'EvaluationResults[].EvaluationResultIdentifier.EvaluationResultQualifier'
Under the Hood: AWS Config works by recording resource configuration changes as "configuration items" in a timeline. When you create a Config rule, AWS evaluates each resource against that rule whenever its configuration changes (or on a schedule). This is continuous compliance at the cloud provider level — no scanning agent needed.
Gotcha: AWS CloudTrail is enabled by default for management events, but retains only 90 days in the Event History console. For compliance (most frameworks require 1 year minimum), you must create a Trail that delivers to an S3 bucket. Many teams discover this gap only when they need logs from four months ago during an incident investigation.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What two Kubernetes CRD types does OPA Gatekeeper introduce? | ConstraintTemplate (defines the Rego logic) and Constraint (applies it with parameters and scope). |
| Why should you always exclude kube-system from Gatekeeper constraints? | Blocking system pods (kube-dns, metrics-server) from rescheduling breaks cluster DNS and monitoring. |
| What's the difference between CIS Benchmark Level 1 and Level 2? | Level 1 = basic hardening, minimal performance impact, apply everywhere. Level 2 = defense in depth, may impact usability, for high-security environments. |
What does enforcementAction: dryrun do in Gatekeeper? |
Records violations in .status.violations but doesn't block admission requests. Safe for gradual rollout. |
| How long does AWS CloudTrail retain events by default without a Trail? | 90 days in Event History. For longer retention, create a Trail to S3. |
Part 5: The Audit Log Pipeline¶
Compliance frameworks require audit logs that answer four questions: who did what, when, and from where. Let's trace the complete pipeline from source to immutable storage.
The audit log pipeline:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Application │ │ Aggregator │ │ Immutable │ │ SIEM / │
│ Sources │───▶│ (ship off │───▶│ Storage │───▶│ Analysis │
│ │ │ host ASAP) │ │ (tamper- │ │ │
│ • auditd │ │ │ │ proof) │ │ • Dashboards │
│ • K8s API │ │ • Filebeat │ │ │ │ • Alerts │
│ • CloudTrail │ │ • Fluentd │ │ • S3 Object │ │ • Forensic │
│ • App logs │ │ • audisp │ │ Lock │ │ search │
└─────────────┘ └─────────────┘ │ • WORM │ └─────────────┘
└─────────────┘
Layer 1: Linux Audit Framework (auditd)¶
auditd captures kernel-level events — every syscall, file access, and authentication event. Even root can't evade it without first disabling the audit subsystem, which itself generates an audit event.
# Key audit rules for compliance (/etc/audit/rules.d/compliance.rules)
# Track all authentication events
-w /var/log/faillog -p wa -k logins
-w /var/log/lastlog -p wa -k logins
# Monitor sensitive files
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/sudoers -p wa -k sudo_changes
-w /etc/ssh/sshd_config -p wa -k sshd_config
# Track privileged commands
-a always,exit -F path=/usr/bin/sudo -F perm=x -k privileged_sudo
-a always,exit -F path=/usr/bin/passwd -F perm=x -k privileged_passwd
# Monitor privilege escalation
-a always,exit -F arch=b64 -S setuid -S setgid -k privilege_escalation
# CRITICAL: Make rules immutable (last line — requires reboot to change)
-e 2
Breaking down the syntax:
| Flag | Meaning |
|---|---|
-w /path |
Watch this file or directory |
-p wa |
Trigger on write (w) and attribute change (a) |
-k tag |
Attach a searchable key tag |
-a always,exit |
Always generate a record when the syscall exits |
-F arch=b64 |
Apply to 64-bit syscalls |
-S execve |
Monitor this specific syscall |
-e 2 |
Lock rules — immutable until reboot |
Under the Hood: The
-e 2flag sets the kernel audit configuration to "locked." Even root cannot modify, add, or delete audit rules after this. Only a reboot resets the lock. This is required by PCI-DSS 10.5.2 (protection of audit trails). But test your rules exhaustively before enabling immutability — fixing a bad rule requires bouncing the machine.Trivia: The Linux audit framework was added in kernel 2.6 (2003) to meet Common Criteria (CAPP) certification requirements. It's one of the few subsystems where the kernel generates structured records rather than just writing to dmesg. The Bellcore paper from 1990 on cryptographic timestamping — which used hash chains to prevent backdating documents — directly inspired both immutable audit logging and, 18 years later, Bitcoin's blockchain.
Layer 2: Kubernetes Audit Logs¶
The Kubernetes API server can log every request, but doesn't by default. Without the
--audit-policy-file flag, someone can kubectl delete namespace production and there
is zero record of who did it.
# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log all secret access (who's reading your credentials?)
- level: Metadata
resources:
- group: ""
resources: ["secrets"]
# Log pod exec/attach at full detail (someone opened a shell)
- level: RequestResponse
resources:
- group: ""
resources: ["pods/exec", "pods/attach"]
# Log all mutations (creates, updates, deletes)
- level: Request
verbs: ["create", "update", "patch", "delete"]
# Everything else: just metadata
- level: Metadata
The four audit levels, from least to most verbose:
| Level | What's logged | Use for |
|---|---|---|
None |
Nothing | Health check endpoints, noisy read-only paths |
Metadata |
User, timestamp, resource, verb | Most resources — low overhead |
Request |
Metadata + request body | Mutation tracking — what changed |
RequestResponse |
Metadata + request + response body | Sensitive ops like pods/exec |
Gotcha: Never log secrets at
RequestResponselevel. The audit log will contain base64-encoded database passwords, API keys, and TLS certificates. Anyone with access to the audit log gets access to all secrets. UseMetadatafor secrets — you get who accessed which secret and when, without the content.
Layer 3: AWS CloudTrail¶
CloudTrail is the audit log for your AWS account. It records every API call: who made it, from what IP, what they did, and whether it succeeded.
# Who deleted that S3 bucket?
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=DeleteBucket \
--start-time "2026-03-01" \
--end-time "2026-03-23" \
--query 'Events[].{Time:EventTime,User:Username,Source:EventSource}'
# Create a trail for long-term retention (required for compliance)
aws cloudtrail create-trail \
--name compliance-trail \
--s3-bucket-name my-audit-logs-bucket \
--is-multi-region-trail \
--enable-log-file-validation
That last flag — --enable-log-file-validation — is critical. It creates digest files
with SHA-256 hashes of every log file. If anyone tampers with a log file, the hash won't
match. Auditors love this.
Layer 4: Immutable Storage¶
Logs that can be deleted aren't audit logs — they're suggestions. True immutability requires storage that even administrators cannot modify during the retention period.
# S3 Object Lock in Compliance mode — nobody can delete, not even root AWS account
aws s3api put-object-lock-configuration \
--bucket audit-logs-production \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "COMPLIANCE",
"Years": 1
}
}
}'
Under the Hood: S3 Object Lock in Compliance mode is genuinely immutable — AWS themselves cannot override it. Objects are retained for the specified period regardless of who requests deletion. This is the gold standard for PCI-DSS (1 year), HIPAA (6 years), and SOC 2 (varies by policy). Governance mode allows privileged users to override — fine for dev, never for compliance.
Part 6: Compliance in CI/CD — The Gate That Matters¶
The best time to catch a compliance violation is before it reaches production. Here's a GitHub Actions workflow that runs InSpec against a container image and blocks deployment on critical failures:
# .github/workflows/compliance-gate.yml
name: Compliance Gate
on:
push:
branches: [main]
pull_request:
jobs:
compliance-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build container image
run: docker build -t myapp:${{ github.sha }} .
- name: Start test container
run: docker run -d --name compliance-target myapp:${{ github.sha }}
- name: Run InSpec CIS baseline
run: |
docker run --rm \
-v $(pwd)/compliance:/share \
--link compliance-target \
chef/inspec exec /share/cis-profile \
-t docker://compliance-target \
--reporter json:/share/results.json cli
continue-on-error: true
- name: Evaluate compliance results
run: |
CRITICAL=$(jq '[.profiles[].controls[] |
select(.impact >= 0.7 and .results[].status == "failed")] |
length' compliance/results.json)
TOTAL_FAIL=$(jq '[.profiles[].controls[] |
select(.results[].status == "failed")] |
length' compliance/results.json)
echo "Critical failures: $CRITICAL"
echo "Total failures: $TOTAL_FAIL"
if [ "$CRITICAL" -gt 0 ]; then
echo "::error::BLOCKED: $CRITICAL critical compliance failures"
exit 1
fi
- name: Archive compliance evidence
if: always()
uses: actions/upload-artifact@v4
with:
name: compliance-evidence-${{ github.sha }}
path: compliance/results.json
retention-days: 365
- name: Cleanup
if: always()
run: docker rm -f compliance-target
Key design decisions in this pipeline:
- Scan the container image, not the CI runner — you're testing what ships
- Block on critical (impact >= 0.7) failures only — a missing login banner (CAT III) shouldn't block a security patch
- Archive evidence with the git SHA — auditors can trace any evidence artifact to the exact code that produced it
- 365-day retention — meets PCI-DSS's 1-year requirement
if: always()on evidence archival — even failed runs produce evidence (proving the gate works)
Gotcha: A compliance gate that nobody can override is a compliance gate that blocks security patches. Build a documented break-glass procedure: CAT III overrides need team lead approval, CAT I overrides need VP sign-off. Log every override. The override is not a bypass — it's a tracked exception with an audit trail.
Part 7: Evidence Collection Automation¶
The audit is in 30 days. Here's the script that replaces six weeks of screenshot-taking:
#!/usr/bin/env bash
# evidence-collect.sh — Automated compliance evidence collection
set -euo pipefail
TIMESTAMP=$(date -u +%Y-%m-%dT%H%M%SZ)
HOSTNAME=$(hostname -f)
EVIDENCE_DIR="/var/evidence/${TIMESTAMP}_${HOSTNAME}"
mkdir -p "$EVIDENCE_DIR"
echo "=== Collecting evidence: $HOSTNAME at $TIMESTAMP ==="
# System identity
echo "$HOSTNAME" > "$EVIDENCE_DIR/hostname.txt"
uname -a > "$EVIDENCE_DIR/system-info.txt"
cat /etc/os-release >> "$EVIDENCE_DIR/system-info.txt"
# Security configuration
cp /etc/ssh/sshd_config "$EVIDENCE_DIR/"
cp /etc/audit/auditd.conf "$EVIDENCE_DIR/" 2>/dev/null || echo "auditd not configured" > "$EVIDENCE_DIR/auditd.conf"
cp /etc/security/pwquality.conf "$EVIDENCE_DIR/" 2>/dev/null || true
# Network exposure
ss -tlnp > "$EVIDENCE_DIR/listening-ports.txt"
# Privileged access
getent group wheel sudo 2>/dev/null > "$EVIDENCE_DIR/privileged-users.txt"
cat /etc/sudoers.d/* 2>/dev/null > "$EVIDENCE_DIR/sudoers-drop-in.txt" || true
# Package inventory (for vulnerability correlation)
if command -v rpm &>/dev/null; then
rpm -qa --qf '%{NAME}-%{VERSION}-%{RELEASE}\n' | sort > "$EVIDENCE_DIR/packages.txt"
elif command -v dpkg &>/dev/null; then
dpkg -l | awk '/^ii/ {print $2, $3}' | sort > "$EVIDENCE_DIR/packages.txt"
fi
# OpenSCAP CIS scan (if available)
if command -v oscap &>/dev/null; then
SCAP_CONTENT=$(find /usr/share/xml/scap/ssg/content/ -name 'ssg-*-ds.xml' | head -1)
if [ -n "$SCAP_CONTENT" ]; then
timeout 600 oscap xccdf eval \
--profile xccdf_org.ssgproject.content_profile_cis \
--results "$EVIDENCE_DIR/cis-results.xml" \
--report "$EVIDENCE_DIR/cis-report.html" \
"$SCAP_CONTENT" 2>/dev/null || true
fi
fi
# Integrity: checksum the whole bundle
sha256sum "$EVIDENCE_DIR"/* > "$EVIDENCE_DIR/checksums.sha256"
# Upload to immutable storage
if command -v aws &>/dev/null; then
aws s3 cp "$EVIDENCE_DIR" \
"s3://compliance-evidence/production/${TIMESTAMP}/${HOSTNAME}/" \
--recursive --quiet
fi
echo "=== Evidence collected: $EVIDENCE_DIR ==="
Remember: Good compliance evidence has five properties: Timestamped, Machine-identified, Version-tagged, Immutable, Checksummed. Mnemonic: TMVIC. If any property is missing, an auditor can challenge the evidence. The script above hits all five: UTC timestamp in the directory name, hostname embedded, scan profile version in the SCAP results, uploaded to S3 Object Lock, and SHA-256 checksums.
Run this via cron daily, or trigger it from Ansible after any configuration change. When the auditor asks "show me evidence that auditd was running on server X on March 15th," you query S3 instead of scrambling.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What are the four stages of an audit log pipeline? | Collection (auditd, K8s API, CloudTrail) → Shipping (Filebeat/Fluentd) → Immutable storage (S3 Object Lock) → Analysis (SIEM, dashboards). |
What does -e 2 do in Linux audit rules? |
Makes rules immutable — cannot be changed without a reboot. Even root can't modify them. |
| Why shouldn't you log Kubernetes secrets at RequestResponse level? | The audit log would contain the actual secret values (base64-encoded passwords, keys). Metadata level logs access without content. |
| What five properties must compliance evidence have? | Timestamped, Machine-identified, Version-tagged, Immutable, Checksummed (TMVIC). |
| What's the difference between S3 Object Lock Compliance mode and Governance mode? | Compliance mode: nobody can delete, not even the root AWS account. Governance mode: privileged users can override. Use Compliance for audit logs. |
Part 8: Putting It All Together — The 30-Day Plan¶
You have 30 days before the auditor arrives. Here's the build-up, prioritized by evidence value:
Week 1: Visibility (know what you have)
Day 1-2: Run OpenSCAP/InSpec against all prod hosts. Baseline current state.
Day 3: Enable AWS Config rules for CIS AWS benchmark.
Day 4-5: Deploy Kubernetes audit policy if missing.
Enable CloudTrail trail to S3 with log file validation.
Week 2: Fix the big stuff
Day 6-8: Remediate all CAT I / impact >= 1.0 findings.
- SSH root login disabled
- auditd running and configured
- Encryption at rest enabled
Day 9-10: Deploy policy engine (Gatekeeper or Kyverno) in audit mode.
Start collecting policy violation data.
Week 3: Automate
Day 11-13: Build evidence collection script (or adopt the one above).
Schedule daily runs via cron or Ansible.
Day 14-15: Wire InSpec into CI/CD pipeline as a compliance gate.
Archive results to S3 with Object Lock.
Week 4: Harden the process
Day 16-18: Remediate CAT II / medium findings.
Document waivers for controls that conflict with application.
Day 19-20: Switch policy engine from audit mode to enforce.
Day 21: Final full scan. Package evidence bundles.
Run internal review — play auditor.
Buffer: Days 22-30
Fix any gaps found in internal review.
Change freeze: no infrastructure changes in the final week.
Prepare walkthrough notes for the auditor.
War Story: The "evidence collection phase consumes 60% of audit effort" statistic comes from a 2023 survey of IT compliance teams. Screenshots, log exports, config dumps, policy documents — teams spend weeks compiling these manually. The script above and the CI/CD pipeline together automate the most painful 60%. The auditor gets machine-generated, timestamped, checksummed evidence instead of screenshots. They love it. You love it. Everyone loves it.
Exercises¶
Exercise 1: Read an InSpec Control (2 minutes)¶
Look at this InSpec control and answer: what does it check, and what does impact 1.0
mean?
control 'cis-5.2.1' do
impact 1.0
title 'Ensure SSH root login is disabled'
describe sshd_config do
its('PermitRootLogin') { should eq 'no' }
end
end
Answer
It checks that `PermitRootLogin` in sshd_config is set to `no`. An `impact` of 1.0 means this is a critical control — the highest severity. If this fails, it's a CAT I equivalent finding: direct security risk, must fix immediately.Exercise 2: Write a Rego Violation Rule (5 minutes)¶
Write a Rego violation rule that blocks any pod using the latest image tag.
The container image is at input.review.object.spec.containers[_].image.
Hint
Check if the image string ends with `:latest` using `endswith()`, and also catch images with no tag at all (which default to `:latest`) using `not contains()`.Solution
package k8sdisallowlatest
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
endswith(container.image, ":latest")
msg := sprintf("Container %v uses :latest tag", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not contains(container.image, ":")
msg := sprintf("Container %v has no tag (defaults to :latest)", [container.name])
}
Exercise 3: Design an Audit Rule Set (10 minutes)¶
You're securing a Linux server that handles payment card data (PCI-DSS scope). Write
auditctl rules that cover these PCI requirements:
- Log all access to cardholder data files in /opt/payment/data/
- Log all privileged command execution (sudo)
- Log all authentication events (changes to passwd, shadow)
- Log all changes to audit configuration
Solution
# Cardholder data access (PCI-DSS 10.2.1)
auditctl -w /opt/payment/data/ -p rwxa -k cardholder_data
# Privileged commands (PCI-DSS 10.2.2)
auditctl -a always,exit -F path=/usr/bin/sudo -F perm=x -k privileged_sudo
# Authentication events (PCI-DSS 10.2.5)
auditctl -w /etc/passwd -p wa -k identity
auditctl -w /etc/shadow -p wa -k identity
# Audit configuration changes (PCI-DSS 10.2.6)
auditctl -w /etc/audit/ -p wa -k audit_config
auditctl -w /etc/audit/auditd.conf -p wa -k audit_config
Cheat Sheet¶
| Tool | Purpose | Quick command |
|---|---|---|
| OpenSCAP | SCAP-based scanning (CIS, STIG) | oscap xccdf eval --profile cis --results out.xml content.xml |
| InSpec | Compliance as Ruby code | inspec exec profile -t ssh://host --reporter json:out.json |
| OPA Gatekeeper | K8s admission policy (Rego) | gator test -f ./constraints/ |
| Kyverno | K8s admission policy (YAML) | kyverno apply ./policies/ --resource pod.yaml |
| AWS Config | Cloud resource compliance | aws configservice describe-compliance-by-config-rule |
| CloudTrail | AWS API audit log | aws cloudtrail lookup-events --lookup-attributes ... |
| auditctl | Linux audit rule management | auditctl -w /etc/passwd -p wa -k identity |
| ausearch | Search Linux audit logs | ausearch -k identity -i |
| aureport | Audit log reports | aureport --auth --failed |
| Conftest | OPA/Rego for config files | conftest test deployment.yaml |
| Compliance term | Meaning |
|---|---|
| CAT I / Impact 1.0 | Critical — fix immediately, no exceptions |
| CAT II / Impact 0.7 | Medium — fix with plan, waivers possible |
| CAT III / Impact 0.3 | Low — fix when practical |
| SCAP | Security Content Automation Protocol (NIST standard) |
| XCCDF | Checklist format within SCAP |
| WORM | Write Once Read Many (immutable storage) |
| S3 Object Lock | AWS immutable storage (Compliance or Governance mode) |
| Retention requirement | Duration |
|---|---|
| PCI-DSS | 1 year (3 months immediately accessible) |
| HIPAA | 6 years |
| SOC 2 | 1 year (per org policy) |
| FedRAMP | 90 days online, 1 year archived |
Takeaways¶
-
Manual compliance is compliance theater. If your evidence is screenshots and spreadsheets, it was stale before you saved it. Automate evidence collection and run it continuously.
-
Policy as code turns suggestions into enforcement. A PDF that says "no privileged containers" is a suggestion. An OPA/Kyverno admission webhook is a wall. Ship both.
-
Audit logs must leave the host immediately. An attacker's first move is deleting local logs. Ship to immutable storage (S3 Object Lock in Compliance mode) in near-real-time.
-
Scan at three points: build, deploy, and continuously in production. Build-time scanning validates the image. Runtime scanning catches drift. Neither alone is sufficient.
-
The compliance gate needs a break-glass procedure. A gate that blocks all deploys on any finding — including a missing login banner — will eventually block a critical security patch. Severity-based overrides with an audit trail.
-
Start in audit mode, always. Deploy policies, scan existing resources, fix violations, then enforce. Enforcing day one breaks existing workloads and makes enemies of the policy.
Related Lessons¶
- Secrets Management Without Tears — encryption and secret handling across the stack
- GitOps: The Repo Is the Truth — change management through git, auditable by design
- Log Pipelines: From printf to Dashboard — the infrastructure behind centralized logging
- Supply Chain Security: Trusting Your Dependencies — image signing, SBOMs, and build provenance
- Permission Denied — RBAC, file permissions, SELinux, and access control debugging