Compliance as Code: Automating the Auditor

lesson
compliance-frameworks
policy-as-code
opa/rego
kyverno
inspec
audit-logging
aws-config
ci/cd
cis-benchmarks ---# Compliance as Code — Automating the Auditor

Topics: compliance frameworks, policy as code, OPA/Rego, Kyverno, InSpec, audit logging, AWS Config, CI/CD, CIS Benchmarks Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

The auditor is coming in 30 days.

Your company processes payments (PCI-DSS scope), stores health records for a partner (HIPAA-adjacent), and just signed an enterprise customer who requires SOC 2 Type II. The VP of Engineering just forwarded you an email from the compliance team: "Please provide evidence of current security controls across all production systems."

You open the shared drive. There's a spreadsheet called compliance-tracker-FINAL-v3-ACTUAL-FINAL.xlsx. The last update was six months ago. Column F says "SSH hardened" for 47 servers. Column G says "Reviewed by: Dave." Dave left the company in October.

You have 30 days. You can spend them filling out spreadsheets and taking screenshots, or you can build a system that produces compliance evidence continuously, automatically, and honestly. By the end of this lesson, you'll have the tools and patterns to do the second thing.

We'll cover: - Why manual compliance fails (and always will) - The major compliance frameworks and what they actually require - Policy as code with OPA/Rego and Kyverno - Infrastructure compliance with InSpec and AWS Config - Audit logging from Linux hosts through Kubernetes to the cloud - How to wire all of this into your CI/CD pipeline - Automating evidence collection so the next audit is a non-event

Part 1: Why Spreadsheets Lie¶

Before we build anything, let's understand what we're replacing.

War Story: A fintech startup passed their first SOC 2 audit using a manually maintained compliance spreadsheet. The spreadsheet showed all 200+ controls as "met." When a new engineer ran the first automated scan six months later, 34 controls were failing — SSH root login was enabled on 12 servers, auditd wasn't running on 8 hosts, and three S3 buckets were publicly readable. The spreadsheet had been copy-pasted from the previous quarter without re-checking a single control. Point-in-time manual audits are fiction — they tell you what was true once, not what's true now.

Manual compliance fails for three structural reasons:

Problem	What happens	Real cost
Point-in-time snapshots	You scan on Monday, someone changes SSH config on Tuesday	Drift goes undetected until the next audit
Human evidence collection	Screenshots, copy-pasted configs, "I checked it"	Evidence is stale, incomplete, or from the wrong environment
No enforcement	Policy says "no privileged containers" but nothing stops them	Violations accumulate between audits

The compliance maturity ladder:

  Level 0: Manual        Spreadsheets, screenshots, "trust me"
  Level 1: Scripted      One-off scan scripts, run before audit
  Level 2: Scheduled     Cron jobs run scans weekly, reports emailed
  Level 3: Pipeline      Compliance checks in CI/CD, blocks bad builds
  Level 4: Continuous    Real-time monitoring + auto-remediation
  Level 5: Codified      Compliance profiles versioned in git, auditable diffs

Most organizations are at Level 0-1. This lesson gets you to Level 3-4.

Mental Model: Think of compliance like tests in your codebase. You wouldn't ship code that was "manually verified to work last quarter." You run tests on every commit. Compliance should work the same way — automated, continuous, and blocking.

Part 2: The Framework Landscape (What Auditors Actually Want)¶

Different frameworks care about different things, but they share a common core. Learn the core once and you can map it to any framework.

Framework	Who needs it	Focus	Key requirement
SOC 2	SaaS companies	Trust service criteria	Prove controls work over 6-12 months
PCI-DSS	Anyone touching payment cards	Cardholder data protection	Quarterly scans, daily log review
HIPAA	Healthcare data handlers	Protected health information	Encrypt PHI at rest and in transit
CIS Benchmarks	Everyone	OS/container/cloud hardening	Configuration baselines per platform
NIST 800-53	Government contractors	Security controls catalog	Formal control selection and assessment

Trivia: SOC 2 evolved from SAS 70, an auditing standard created for accounting firms in the 1990s. It was never designed for software companies. Its adaptation to tech was driven by cloud computing — enterprise customers needed assurance that SaaS providers were handling data responsibly. This origin explains many of its awkward requirements around "logical access" and "change management" that feel like they were written for a mainframe era — because they were.

The five controls that appear in every framework:

Access management — who can access what, and is it the minimum needed?
Encryption — data at rest and in transit
Audit logging — who did what, when, from where
Change management — how do changes get approved and tracked?
Incident response — what happens when something goes wrong?

Remember: The mnemonic AEACI (pronounced "ACE-ee") — Access, Encryption, Audit, Change, Incident. Master these five control families and you've covered 80% of any compliance framework. The remaining 20% is framework-specific detail.

SOC 2 Type I vs Type II¶

This trips people up constantly. Type I evaluates control design at a single point in time — are the right controls defined? Type II evaluates both design and operating effectiveness over a period (typically 6-12 months) — are controls actually working?

Type II is what enterprise customers require. It's the difference between "we have a lock on the door" (Type I) and "here are 12 months of access logs proving the lock works and only authorized people get in" (Type II). Automated evidence collection is what makes Type II achievable without a dedicated compliance team.

Flashcard Check #1¶

Cover the answers and test yourself.

Question	Answer
What's the difference between SOC 2 Type I and Type II?	Type I = control design at a point in time. Type II = design + effectiveness over 6-12 months.
Name three of the five universal control families.	Access management, encryption, audit logging, change management, incident response.
What compliance maturity level involves CI/CD compliance gates?	Level 3: Pipeline.
Why do point-in-time compliance checks always lie?	Systems drift between checks. A Monday scan says nothing about Friday's reality.

Part 3: Policy as Code — Teaching Machines to Say "No"¶

Compliance documents say things like "all containers must run as non-root." That's a policy. If it lives only in a PDF, it's a suggestion. If it lives in code that blocks non-compliant deployments, it's enforced.

Two tools dominate this space in Kubernetes: OPA Gatekeeper (Rego language) and Kyverno (YAML-native).

Name Origin: Rego is pronounced "ray-go." It's a declarative query language purpose-built for policy by the OPA project (Styra, donated to CNCF). OPA itself stands for Open Policy Agent. Kyverno comes from the Greek word "kyberno" (κυβερνώ), meaning "to govern" — the same root as "Kubernetes" (κυβερνήτης, "helmsman") and "cybernetics." Both tools govern the cluster, just in different languages.

OPA/Rego: Block Privileged Containers¶

Here's a real OPA Gatekeeper policy. First, you define a ConstraintTemplate (the reusable logic):

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sdisallowprivileged
spec:
  crd:
    spec:
      names:
        kind: K8sDisallowPrivileged
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sdisallowprivileged

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          container.securityContext.privileged == true
          msg := sprintf("Privileged container not allowed: %v", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          container.securityContext.privileged == true
          msg := sprintf("Privileged init container not allowed: %v", [container.name])
        }

Then a Constraint (where and how to apply it):

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDisallowPrivileged
metadata:
  name: no-privileged-containers
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system

Let's break down the Rego:

Rego element	What it does
`package k8sdisallowprivileged`	Namespace for this policy
`violation[{"msg": msg}]`	A partial rule — collects all violations into a set
`input.review.object.spec.containers[_]`	Iterates over every container in the pod spec
`container.securityContext.privileged == true`	Checks the privileged flag
`sprintf("...", [container.name])`	Builds a human-readable violation message

Multiple violation blocks are OR'd — if any block produces a message, the admission is denied. Conditions within a block are AND'd — all must be true for that block to fire.

Gotcha: Always exclude kube-system and gatekeeper-system from your constraints. If a constraint blocks rescheduling of kube-dns or metrics-server, your cluster's DNS and monitoring break. A policy that takes down the cluster is worse than no policy.

Kyverno: The YAML-Native Alternative¶

Same policy in Kyverno — no new language to learn:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: Enforce
  rules:
    - name: no-privileged
      match:
        any:
          - resources:
              kinds:
                - Pod
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - securityContext:
                  privileged: "!true"

Feature	OPA Gatekeeper	Kyverno
Policy language	Rego (custom DSL)	YAML (native K8s)
Learning curve	Steep (Rego is unique)	Low (just YAML)
Mutation support	Limited	First-class
Resource generation	No	Yes (auto-create NetworkPolicies, etc.)
Best for	Complex logic, multi-platform orgs	K8s-native teams wanting fast adoption

Mental Model: OPA is a general-purpose policy engine that happens to work with Kubernetes. Kyverno is a Kubernetes policy engine, period. If you only need K8s admission control, Kyverno gets you there faster. If you also need to enforce policy on Terraform plans, API authorization, and CI/CD pipelines with the same language, OPA is the better investment.

Testing Policies Before They Bite¶

Both tools support an audit/dry-run mode. The rollout pattern:

1. Deploy policy in Audit mode (Gatekeeper: enforcementAction: dryrun)
2. Wait 1-2 weeks. Collect violations without blocking anything.
3. Review: how many existing resources violate the policy?
4. Fix the violations.
5. Switch to Enforce mode (Gatekeeper: enforcementAction: deny).
6. Never skip step 1 — enforcing immediately breaks existing workloads.

For OPA, test Rego with unit tests before deploying:

package k8sdisallowprivileged_test

test_deny_privileged {
  violation with input as {
    "review": {"object": {"spec": {"containers": [
      {"name": "evil", "securityContext": {"privileged": true}}
    ]}}}
  }
}

test_allow_unprivileged {
  not violation with input as {
    "review": {"object": {"spec": {"containers": [
      {"name": "safe", "securityContext": {"privileged": false}}
    ]}}}
  }
}

# Run tests
opa test ./policies/ -v

# Syntax and type checking
opa check --strict ./policies/

# Test Gatekeeper ConstraintTemplates locally
gator test -f ./constraints/

Part 4: Infrastructure Compliance — Scanning What Exists¶

Policy engines prevent bad things from being created. Compliance scanning finds bad things that already exist. You need both.

InSpec: CIS Linux Benchmark Profile¶

InSpec expresses compliance controls as human-readable Ruby code. Here's a profile that checks key CIS Linux benchmark controls:

# controls/cis_linux.rb

control 'cis-1.1.1' do
  impact 1.0
  title 'Ensure mounting of cramfs is disabled'
  desc  'Removing support for unnecessary filesystems reduces attack surface'

  describe kernel_module('cramfs') do
    it { should_not be_loaded }
    it { should be_disabled }
  end
end

control 'cis-5.2.1' do
  impact 1.0
  title 'Ensure SSH root login is disabled'
  desc  'Disabling root login forces administrators to authenticate with personal accounts'

  describe sshd_config do
    its('PermitRootLogin') { should eq 'no' }
  end
end

control 'cis-5.2.6' do
  impact 0.7
  title 'Ensure SSH idle timeout is configured'

  describe sshd_config do
    its('ClientAliveInterval') { should cmp <= 300 }
    its('ClientAliveCountMax') { should cmp <= 3 }
  end
end

control 'cis-4.1.1' do
  impact 1.0
  title 'Ensure auditd is installed and running'

  describe service('auditd') do
    it { should be_installed }
    it { should be_enabled }
    it { should be_running }
  end
end

# Run against a local system
inspec exec ./cis-linux-profile/

# Run against a remote host via SSH
inspec exec ./cis-linux-profile/ -t ssh://admin@10.0.1.50 -i ~/.ssh/audit_key

# Run against a Docker container
inspec exec ./cis-linux-profile/ -t docker://my-container-id

# Use a community profile from the InSpec Supermarket
inspec exec supermarket://dev-sec/linux-baseline

# Multiple output formats (CLI + JSON for automation + HTML for humans)
inspec exec ./cis-linux-profile/ --reporter cli json:results.json html:report.html

Trivia: Chef launched InSpec in 2015 as the first widely adopted compliance-as-code framework. Before InSpec, compliance checks were either manual checklists or opaque scanning tools that produced PDFs. InSpec made each compliance control a readable, executable test — the same paradigm shift that unit testing brought to application code.

CIS Benchmarks: Level 1 vs Level 2¶

CIS Benchmarks define two levels of hardening:

Level	Applies to	Impact	Example controls
Level 1	Every server	Minimal perf impact	Disable unused filesystems, SSH hardening, password policy
Level 2	High-security environments	May impact usability	SELinux enforcing, audit all privileged commands, USB disabled

Gotcha: Don't apply CIS Level 2 everywhere. On a development VM, SELinux enforcing and USB-disable controls create friction without meaningful security benefit. Use Level 1 as the universal baseline and Level 2 for production and systems handling sensitive data. Auditors respect tiered profiles more than blanket application.

AWS Config Rules: Cloud-Native Compliance¶

AWS Config continuously evaluates your resources against rules. Some key compliance rules:

{
  "ConfigRuleName": "s3-bucket-public-read-prohibited",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_PUBLIC_READ_PROHIBITED"
  },
  "Scope": {
    "ComplianceResourceTypes": ["AWS::S3::Bucket"]
  }
}

AWS provides 300+ managed Config rules. The compliance-critical ones:

AWS Config Rule	What it checks	Framework mapping
`s3-bucket-public-read-prohibited`	No publicly readable S3 buckets	PCI-DSS, SOC 2, CIS AWS
`encrypted-volumes`	EBS volumes are encrypted	HIPAA, PCI-DSS
`cloudtrail-enabled`	CloudTrail is active	All frameworks
`iam-root-access-key-check`	Root account has no access keys	CIS AWS Level 1
`rds-storage-encrypted`	RDS instances are encrypted	HIPAA, PCI-DSS
`multi-region-cloudtrail-enabled`	CloudTrail covers all regions	SOC 2, PCI-DSS

# Check compliance status via CLI
aws configservice describe-compliance-by-config-rule \
  --config-rule-names s3-bucket-public-read-prohibited \
  --query 'ComplianceByConfigRules[].Compliance.ComplianceType'

# Get non-compliant resources
aws configservice get-compliance-details-by-config-rule \
  --config-rule-name s3-bucket-public-read-prohibited \
  --compliance-types NON_COMPLIANT \
  --query 'EvaluationResults[].EvaluationResultIdentifier.EvaluationResultQualifier'

Under the Hood: AWS Config works by recording resource configuration changes as "configuration items" in a timeline. When you create a Config rule, AWS evaluates each resource against that rule whenever its configuration changes (or on a schedule). This is continuous compliance at the cloud provider level — no scanning agent needed.

Gotcha: AWS CloudTrail is enabled by default for management events, but retains only 90 days in the Event History console. For compliance (most frameworks require 1 year minimum), you must create a Trail that delivers to an S3 bucket. Many teams discover this gap only when they need logs from four months ago during an incident investigation.

Flashcard Check #2¶

Question	Answer
What two Kubernetes CRD types does OPA Gatekeeper introduce?	ConstraintTemplate (defines the Rego logic) and Constraint (applies it with parameters and scope).
Why should you always exclude kube-system from Gatekeeper constraints?	Blocking system pods (kube-dns, metrics-server) from rescheduling breaks cluster DNS and monitoring.
What's the difference between CIS Benchmark Level 1 and Level 2?	Level 1 = basic hardening, minimal performance impact, apply everywhere. Level 2 = defense in depth, may impact usability, for high-security environments.
What does `enforcementAction: dryrun` do in Gatekeeper?	Records violations in `.status.violations` but doesn't block admission requests. Safe for gradual rollout.
How long does AWS CloudTrail retain events by default without a Trail?	90 days in Event History. For longer retention, create a Trail to S3.

Part 5: The Audit Log Pipeline¶

Compliance frameworks require audit logs that answer four questions: who did what, when, and from where. Let's trace the complete pipeline from source to immutable storage.

The audit log pipeline:

  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
  │ Application  │    │ Aggregator   │    │ Immutable    │    │ SIEM /       │
  │ Sources      │───▶│ (ship off    │───▶│ Storage      │───▶│ Analysis     │
  │              │    │  host ASAP)  │    │ (tamper-     │    │              │
  │ • auditd     │    │              │    │  proof)      │    │ • Dashboards │
  │ • K8s API    │    │ • Filebeat   │    │              │    │ • Alerts     │
  │ • CloudTrail │    │ • Fluentd    │    │ • S3 Object  │    │ • Forensic   │
  │ • App logs   │    │ • audisp     │    │   Lock       │    │   search     │
  └─────────────┘    └─────────────┘    │ • WORM       │    └─────────────┘
                                         └─────────────┘

Layer 1: Linux Audit Framework (auditd)¶

auditd captures kernel-level events — every syscall, file access, and authentication event. Even root can't evade it without first disabling the audit subsystem, which itself generates an audit event.

# Key audit rules for compliance (/etc/audit/rules.d/compliance.rules)

# Track all authentication events
-w /var/log/faillog -p wa -k logins
-w /var/log/lastlog -p wa -k logins

# Monitor sensitive files
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/sudoers -p wa -k sudo_changes
-w /etc/ssh/sshd_config -p wa -k sshd_config

# Track privileged commands
-a always,exit -F path=/usr/bin/sudo -F perm=x -k privileged_sudo
-a always,exit -F path=/usr/bin/passwd -F perm=x -k privileged_passwd

# Monitor privilege escalation
-a always,exit -F arch=b64 -S setuid -S setgid -k privilege_escalation

# CRITICAL: Make rules immutable (last line — requires reboot to change)
-e 2

Breaking down the syntax:

Flag	Meaning
`-w /path`	Watch this file or directory
`-p wa`	Trigger on write (w) and attribute change (a)
`-k tag`	Attach a searchable key tag
`-a always,exit`	Always generate a record when the syscall exits
`-F arch=b64`	Apply to 64-bit syscalls
`-S execve`	Monitor this specific syscall
`-e 2`	Lock rules — immutable until reboot

Under the Hood: The -e 2 flag sets the kernel audit configuration to "locked." Even root cannot modify, add, or delete audit rules after this. Only a reboot resets the lock. This is required by PCI-DSS 10.5.2 (protection of audit trails). But test your rules exhaustively before enabling immutability — fixing a bad rule requires bouncing the machine.

Trivia: The Linux audit framework was added in kernel 2.6 (2003) to meet Common Criteria (CAPP) certification requirements. It's one of the few subsystems where the kernel generates structured records rather than just writing to dmesg. The Bellcore paper from 1990 on cryptographic timestamping — which used hash chains to prevent backdating documents — directly inspired both immutable audit logging and, 18 years later, Bitcoin's blockchain.

Layer 2: Kubernetes Audit Logs¶

The Kubernetes API server can log every request, but doesn't by default. Without the --audit-policy-file flag, someone can kubectl delete namespace production and there is zero record of who did it.

# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log all secret access (who's reading your credentials?)
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets"]

  # Log pod exec/attach at full detail (someone opened a shell)
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec", "pods/attach"]

  # Log all mutations (creates, updates, deletes)
  - level: Request
    verbs: ["create", "update", "patch", "delete"]

  # Everything else: just metadata
  - level: Metadata

The four audit levels, from least to most verbose:

Level	What's logged	Use for
`None`	Nothing	Health check endpoints, noisy read-only paths
`Metadata`	User, timestamp, resource, verb	Most resources — low overhead
`Request`	Metadata + request body	Mutation tracking — what changed
`RequestResponse`	Metadata + request + response body	Sensitive ops like `pods/exec`

Gotcha: Never log secrets at RequestResponse level. The audit log will contain base64-encoded database passwords, API keys, and TLS certificates. Anyone with access to the audit log gets access to all secrets. Use Metadata for secrets — you get who accessed which secret and when, without the content.

Layer 3: AWS CloudTrail¶

CloudTrail is the audit log for your AWS account. It records every API call: who made it, from what IP, what they did, and whether it succeeded.

# Who deleted that S3 bucket?
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteBucket \
  --start-time "2026-03-01" \
  --end-time "2026-03-23" \
  --query 'Events[].{Time:EventTime,User:Username,Source:EventSource}'

# Create a trail for long-term retention (required for compliance)
aws cloudtrail create-trail \
  --name compliance-trail \
  --s3-bucket-name my-audit-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation

That last flag — --enable-log-file-validation — is critical. It creates digest files with SHA-256 hashes of every log file. If anyone tampers with a log file, the hash won't match. Auditors love this.

Layer 4: Immutable Storage¶

Logs that can be deleted aren't audit logs — they're suggestions. True immutability requires storage that even administrators cannot modify during the retention period.

# S3 Object Lock in Compliance mode — nobody can delete, not even root AWS account
aws s3api put-object-lock-configuration \
  --bucket audit-logs-production \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Years": 1
      }
    }
  }'

Under the Hood: S3 Object Lock in Compliance mode is genuinely immutable — AWS themselves cannot override it. Objects are retained for the specified period regardless of who requests deletion. This is the gold standard for PCI-DSS (1 year), HIPAA (6 years), and SOC 2 (varies by policy). Governance mode allows privileged users to override — fine for dev, never for compliance.

Part 6: Compliance in CI/CD — The Gate That Matters¶

The best time to catch a compliance violation is before it reaches production. Here's a GitHub Actions workflow that runs InSpec against a container image and blocks deployment on critical failures:

# .github/workflows/compliance-gate.yml
name: Compliance Gate
on:
  push:
    branches: [main]
  pull_request:

jobs:
  compliance-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build container image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Start test container
        run: docker run -d --name compliance-target myapp:${{ github.sha }}

      - name: Run InSpec CIS baseline
        run: |
          docker run --rm \
            -v $(pwd)/compliance:/share \
            --link compliance-target \
            chef/inspec exec /share/cis-profile \
            -t docker://compliance-target \
            --reporter json:/share/results.json cli
        continue-on-error: true

      - name: Evaluate compliance results
        run: |
          CRITICAL=$(jq '[.profiles[].controls[] |
            select(.impact >= 0.7 and .results[].status == "failed")] |
            length' compliance/results.json)
          TOTAL_FAIL=$(jq '[.profiles[].controls[] |
            select(.results[].status == "failed")] |
            length' compliance/results.json)
          echo "Critical failures: $CRITICAL"
          echo "Total failures: $TOTAL_FAIL"
          if [ "$CRITICAL" -gt 0 ]; then
            echo "::error::BLOCKED: $CRITICAL critical compliance failures"
            exit 1
          fi

      - name: Archive compliance evidence
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: compliance-evidence-${{ github.sha }}
          path: compliance/results.json
          retention-days: 365

      - name: Cleanup
        if: always()
        run: docker rm -f compliance-target

Key design decisions in this pipeline:

Scan the container image, not the CI runner — you're testing what ships
Block on critical (impact >= 0.7) failures only — a missing login banner (CAT III) shouldn't block a security patch
Archive evidence with the git SHA — auditors can trace any evidence artifact to the exact code that produced it
365-day retention — meets PCI-DSS's 1-year requirement
if: always() on evidence archival — even failed runs produce evidence (proving the gate works)

Gotcha: A compliance gate that nobody can override is a compliance gate that blocks security patches. Build a documented break-glass procedure: CAT III overrides need team lead approval, CAT I overrides need VP sign-off. Log every override. The override is not a bypass — it's a tracked exception with an audit trail.

Part 7: Evidence Collection Automation¶

The audit is in 30 days. Here's the script that replaces six weeks of screenshot-taking:

#!/usr/bin/env bash
# evidence-collect.sh — Automated compliance evidence collection
set -euo pipefail

TIMESTAMP=$(date -u +%Y-%m-%dT%H%M%SZ)
HOSTNAME=$(hostname -f)
EVIDENCE_DIR="/var/evidence/${TIMESTAMP}_${HOSTNAME}"
mkdir -p "$EVIDENCE_DIR"

echo "=== Collecting evidence: $HOSTNAME at $TIMESTAMP ==="

# System identity
echo "$HOSTNAME" > "$EVIDENCE_DIR/hostname.txt"
uname -a > "$EVIDENCE_DIR/system-info.txt"
cat /etc/os-release >> "$EVIDENCE_DIR/system-info.txt"

# Security configuration
cp /etc/ssh/sshd_config "$EVIDENCE_DIR/"
cp /etc/audit/auditd.conf "$EVIDENCE_DIR/" 2>/dev/null || echo "auditd not configured" > "$EVIDENCE_DIR/auditd.conf"
cp /etc/security/pwquality.conf "$EVIDENCE_DIR/" 2>/dev/null || true

# Network exposure
ss -tlnp > "$EVIDENCE_DIR/listening-ports.txt"

# Privileged access
getent group wheel sudo 2>/dev/null > "$EVIDENCE_DIR/privileged-users.txt"
cat /etc/sudoers.d/* 2>/dev/null > "$EVIDENCE_DIR/sudoers-drop-in.txt" || true

# Package inventory (for vulnerability correlation)
if command -v rpm &>/dev/null; then
  rpm -qa --qf '%{NAME}-%{VERSION}-%{RELEASE}\n' | sort > "$EVIDENCE_DIR/packages.txt"
elif command -v dpkg &>/dev/null; then
  dpkg -l | awk '/^ii/ {print $2, $3}' | sort > "$EVIDENCE_DIR/packages.txt"
fi

# OpenSCAP CIS scan (if available)
if command -v oscap &>/dev/null; then
  SCAP_CONTENT=$(find /usr/share/xml/scap/ssg/content/ -name 'ssg-*-ds.xml' | head -1)
  if [ -n "$SCAP_CONTENT" ]; then
    timeout 600 oscap xccdf eval \
      --profile xccdf_org.ssgproject.content_profile_cis \
      --results "$EVIDENCE_DIR/cis-results.xml" \
      --report "$EVIDENCE_DIR/cis-report.html" \
      "$SCAP_CONTENT" 2>/dev/null || true
  fi
fi

# Integrity: checksum the whole bundle
sha256sum "$EVIDENCE_DIR"/* > "$EVIDENCE_DIR/checksums.sha256"

# Upload to immutable storage
if command -v aws &>/dev/null; then
  aws s3 cp "$EVIDENCE_DIR" \
    "s3://compliance-evidence/production/${TIMESTAMP}/${HOSTNAME}/" \
    --recursive --quiet
fi

echo "=== Evidence collected: $EVIDENCE_DIR ==="

Remember: Good compliance evidence has five properties: Timestamped, Machine-identified, Version-tagged, Immutable, Checksummed. Mnemonic: TMVIC. If any property is missing, an auditor can challenge the evidence. The script above hits all five: UTC timestamp in the directory name, hostname embedded, scan profile version in the SCAP results, uploaded to S3 Object Lock, and SHA-256 checksums.

Run this via cron daily, or trigger it from Ansible after any configuration change. When the auditor asks "show me evidence that auditd was running on server X on March 15th," you query S3 instead of scrambling.

Flashcard Check #3¶

Question	Answer
What are the four stages of an audit log pipeline?	Collection (auditd, K8s API, CloudTrail) → Shipping (Filebeat/Fluentd) → Immutable storage (S3 Object Lock) → Analysis (SIEM, dashboards).
What does `-e 2` do in Linux audit rules?	Makes rules immutable — cannot be changed without a reboot. Even root can't modify them.
Why shouldn't you log Kubernetes secrets at RequestResponse level?	The audit log would contain the actual secret values (base64-encoded passwords, keys). Metadata level logs access without content.
What five properties must compliance evidence have?	Timestamped, Machine-identified, Version-tagged, Immutable, Checksummed (TMVIC).
What's the difference between S3 Object Lock Compliance mode and Governance mode?	Compliance mode: nobody can delete, not even the root AWS account. Governance mode: privileged users can override. Use Compliance for audit logs.

Part 8: Putting It All Together — The 30-Day Plan¶

You have 30 days before the auditor arrives. Here's the build-up, prioritized by evidence value:

Week 1: Visibility (know what you have)
  Day 1-2:  Run OpenSCAP/InSpec against all prod hosts. Baseline current state.
  Day 3:    Enable AWS Config rules for CIS AWS benchmark.
  Day 4-5:  Deploy Kubernetes audit policy if missing.
            Enable CloudTrail trail to S3 with log file validation.

Week 2: Fix the big stuff
  Day 6-8:  Remediate all CAT I / impact >= 1.0 findings.
            - SSH root login disabled
            - auditd running and configured
            - Encryption at rest enabled
  Day 9-10: Deploy policy engine (Gatekeeper or Kyverno) in audit mode.
            Start collecting policy violation data.

Week 3: Automate
  Day 11-13: Build evidence collection script (or adopt the one above).
             Schedule daily runs via cron or Ansible.
  Day 14-15: Wire InSpec into CI/CD pipeline as a compliance gate.
             Archive results to S3 with Object Lock.

Week 4: Harden the process
  Day 16-18: Remediate CAT II / medium findings.
             Document waivers for controls that conflict with application.
  Day 19-20: Switch policy engine from audit mode to enforce.
  Day 21:    Final full scan. Package evidence bundles.
             Run internal review — play auditor.

Buffer: Days 22-30
  Fix any gaps found in internal review.
  Change freeze: no infrastructure changes in the final week.
  Prepare walkthrough notes for the auditor.

War Story: The "evidence collection phase consumes 60% of audit effort" statistic comes from a 2023 survey of IT compliance teams. Screenshots, log exports, config dumps, policy documents — teams spend weeks compiling these manually. The script above and the CI/CD pipeline together automate the most painful 60%. The auditor gets machine-generated, timestamped, checksummed evidence instead of screenshots. They love it. You love it. Everyone loves it.

Exercises¶

Exercise 1: Read an InSpec Control (2 minutes)¶

Look at this InSpec control and answer: what does it check, and what does impact 1.0 mean?

control 'cis-5.2.1' do
  impact 1.0
  title 'Ensure SSH root login is disabled'
  describe sshd_config do
    its('PermitRootLogin') { should eq 'no' }
  end
end

Answer

It checks that `PermitRootLogin` in sshd_config is set to `no`. An `impact` of 1.0 means this is a critical control — the highest severity. If this fails, it's a CAT I equivalent finding: direct security risk, must fix immediately.

Exercise 2: Write a Rego Violation Rule (5 minutes)¶

Write a Rego violation rule that blocks any pod using the latest image tag. The container image is at input.review.object.spec.containers[_].image.

Hint

Check if the image string ends with `:latest` using `endswith()`, and also catch images with no tag at all (which default to `:latest`) using `not contains()`.

Solution

package k8sdisallowlatest

violation[{"msg": msg}] {
  container := input.review.object.spec.containers[_]
  endswith(container.image, ":latest")
  msg := sprintf("Container %v uses :latest tag", [container.name])
}

violation[{"msg": msg}] {
  container := input.review.object.spec.containers[_]
  not contains(container.image, ":")
  msg := sprintf("Container %v has no tag (defaults to :latest)", [container.name])
}

Exercise 3: Design an Audit Rule Set (10 minutes)¶

You're securing a Linux server that handles payment card data (PCI-DSS scope). Write auditctl rules that cover these PCI requirements: - Log all access to cardholder data files in /opt/payment/data/ - Log all privileged command execution (sudo) - Log all authentication events (changes to passwd, shadow) - Log all changes to audit configuration