Infrastructure Testing — Street-Level Ops¶

Quick Diagnosis Commands¶

# Run Terratest tests (from the test/ directory in your module)
cd modules/vpc/test && go test -v -timeout 30m -run TestVpcModule

# Run multiple tests in parallel
go test -v -timeout 60m -parallel 4 ./test/...

# Run Conftest against Terraform plan
terraform plan -out=plan.tfplan && terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/terraform/

# Run InSpec against localhost
inspec exec ./profiles/my-baseline/

# Run InSpec against remote host
inspec exec ./profiles/my-baseline/ -t ssh://ubuntu@10.0.1.5 --key-files ~/.ssh/id_rsa

# Validate Kubernetes manifests with kubeconform
kubeconform -strict -summary k8s/

# Check Kyverno policies
kyverno apply policy/ --resource k8s/deployment.yaml

# Validate all YAML in a directory
find k8s/ -name "*.yaml" -exec yamllint -d '{extends: relaxed}' {} +

Gotcha: Terraform Destroy Fails — Resources Are Leaked¶

Rule: Always wrap Terraform apply with defer terraform.Destroy() in Terratest. But if the test panics before defer is registered, resources leak. Use a wrapper.

// BAD: if TestMyModule panics before defer, no cleanup
func TestMyModule(t *testing.T) {
    opts := &terraform.Options{TerraformDir: "../"}
    defer terraform.Destroy(t, opts)  // not reached on early panic
    // ... panic happens here ...
    terraform.InitAndApply(t, opts)
}

// BETTER: register defer before anything can fail
func TestMyModule(t *testing.T) {
    opts := &terraform.Options{TerraformDir: "../"}
    terraform.Init(t, opts)

    // Register cleanup immediately after init
    defer terraform.Destroy(t, opts)

    // Now apply
    terraform.Apply(t, opts)
    // assertions...
}

Also: track test resources with tags so you can find and delete leaked infra:

variable "test_run_id" {
  description = "Unique ID for this test run — used to identify leaked resources"
  type        = string
  default     = "manual"
}

resource "aws_vpc" "test" {
  cidr_block = var.vpc_cidr
  tags = {
    Name        = "test-vpc-${var.test_run_id}"
    ManagedBy   = "terratest"
    TestRunId   = var.test_run_id
    CreatedAt   = timestamp()
  }
}

uniqueID := random.UniqueId()
opts := &terraform.Options{
    Vars: map[string]interface{}{
        "test_run_id": uniqueID,
    },
}

Gotcha: Conftest Policy Has No Effect¶

Rule: Conftest silently succeeds if it finds no input to check. Verify you're pointing at the right input format.

# This looks like it works but produces no output if plan.json is wrong format
conftest test plan.json --policy policy/

# Debug: check what conftest sees as input
conftest parse plan.json

Debug clue: When Conftest policies silently pass, run opa eval --data policy/main.rego --input plan.json 'data.main.deny' directly. If the result is an empty array, your Rego rule conditions are not matching. The most common cause: Terraform plan JSON nests resource changes under resource_changes[_].change.after, not directly under the resource.

# Verify your Rego parses correctly
opa eval --data policy/main.rego --input plan.json 'data.main.deny'

# If deny is empty array, policies aren't matching — check your input structure
terraform show -json plan.tfplan | jq '.resource_changes[0]'
# Compare to what your Rego expects: input.resource_changes[_]

Pattern: CI Pipeline — Plan → Policy → Apply¶

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths: ['infra/**']
  push:
    branches: [main]
    paths: ['infra/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Format Check
        run: terraform fmt -check -recursive infra/

      - name: Terraform Init
        run: terraform init infra/
        env:
          TF_BACKEND_CONFIG: "bucket=my-tfstate"

      - name: Terraform Validate
        run: terraform validate infra/

      - name: Terraform Plan
        run: |
          terraform plan -out=plan.tfplan infra/
          terraform show -json plan.tfplan > plan.json

      - name: Run Conftest Policies
        run: |
          conftest test plan.json --policy infra/policy/ --output json | tee conftest-results.json
          # Fail if any deny rules triggered
          jq -e '.[] | select(.failures | length > 0) | .failures[]' conftest-results.json && exit 1 || true

      - name: Terraform Apply (main only)
        if: github.ref == 'refs/heads/main'
        run: terraform apply plan.tfplan infra/

  terratest:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    needs: validate
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.21'

      - name: Run Terratest
        run: go test -v -timeout 60m ./infra/modules/.../test/...
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: us-east-1

Pattern: InSpec Profile for Kubernetes Nodes¶

# profiles/k8s-node/controls/cis_k8s.rb

control 'k8s-node-01' do
  impact 1.0
  title 'Kubelet should not allow anonymous auth'

  describe json('/var/lib/kubelet/config.yaml') do
    its(['authentication', 'anonymous', 'enabled']) { should eq false }
  end
end

control 'k8s-node-02' do
  impact 1.0
  title 'Kubelet authorization mode should not be AlwaysAllow'

  describe json('/var/lib/kubelet/config.yaml') do
    its(['authorization', 'mode']) { should_not eq 'AlwaysAllow' }
  end
end

control 'k8s-node-03' do
  impact 0.7
  title 'Protect kernel defaults'

  {
    'net.bridge.bridge-nf-call-iptables'  => '1',
    'net.ipv4.ip_forward'                 => '1',
    'net.bridge.bridge-nf-call-ip6tables' => '1',
  }.each do |param, value|
    describe kernel_parameter(param) do
      its('value') { should eq value.to_i }
    end
  end
end

# Run against all nodes using SSH
for node in $(kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'); do
  echo "=== Checking $node ==="
  inspec exec profiles/k8s-node/ \
    -t ssh://ubuntu@${node} \
    --key-files ~/.ssh/cluster-key \
    --reporter cli json:results/${node}.json
done

# Aggregate results
jq -s 'map(.profiles[].controls[]) | group_by(.id) | map({id: .[0].id, failures: map(select(.results[].status == "failed")) | length})' results/*.json

Scenario: Writing a Policy to Catch Cost Blowouts¶

# policy/terraform/expensive_resources.rego
package main

# Deny instance types larger than allowed
denied_instance_types := {
  "p4d.24xlarge", "p3.16xlarge", "x1e.32xlarge",
  "u-6tb1.metal", "u-9tb1.metal", "u-12tb1.metal"
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.after.instance_type == denied_instance_types[_]
  msg := sprintf(
    "Instance '%s' uses prohibited instance type '%s' — get approval first",
    [resource.address, resource.change.after.instance_type]
  )
}

# Warn about resources without a cost-center tag
warn[msg] {
  resource := input.resource_changes[_]
  resource.change.actions[_] == "create"
  not resource.change.after.tags["cost-center"]
  msg := sprintf(
    "Resource '%s' will be created without a cost-center tag",
    [resource.address]
  )
}

# Deny huge RDS instances without approval
deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  resource.change.actions[_] == "create"
  startswith(resource.change.after.instance_class, "db.r5.")
  not resource.change.after.tags["approved-by"]
  msg := sprintf(
    "Large RDS instance '%s' requires 'approved-by' tag",
    [resource.address]
  )
}

# Test the policy
conftest test plan.json --policy policy/terraform/ --output json | \
  jq '.[] | {filename: .filename, failures: .failures, warnings: .warnings}'

Emergency: Leaked Test Resources in AWS¶

# Find all resources tagged by terratest
aws resourcegroupstaggingapi get-resources \
  --tag-filters Key=ManagedBy,Values=terratest \
  --query 'ResourceTagMappingList[*].{ARN: ResourceARN, Tags: Tags}' \
  --output json

# Find old test VPCs (older than 2 hours)
aws ec2 describe-vpcs \
  --filters "Name=tag:ManagedBy,Values=terratest" \
  --query 'Vpcs[*].{VpcId: VpcId, Name: Tags[?Key==`Name`].Value | [0], Created: Tags[?Key==`CreatedAt`].Value | [0]}' \
  --output table

# Nuke all resources with a specific test run tag
# WARNING: destructive — verify first

War story: A team ran Terratest in CI without the test_run_id tagging pattern. Over three months, hundreds of orphaned VPCs, subnets, and NAT gateways accumulated in their test account. The monthly bill grew from $200 to $4,800 before anyone noticed. The cleanup took two days because VPCs with dependencies (ENIs, endpoints, subnets) cannot be deleted in a single command. Tag your test resources from day one.

LEAKED_RUN_ID="abcd1234"
aws ec2 describe-vpcs \
  --filters "Name=tag:TestRunId,Values=${LEAKED_RUN_ID}" \
  --query 'Vpcs[*].VpcId' --output text | \
  xargs -I {} aws ec2 delete-vpc --vpc-id {}

# Use aws-nuke for comprehensive cleanup (careful: very destructive)
# https://github.com/rebuy-de/aws-nuke
# aws-nuke -c nuke-config.yaml --no-dry-run

Useful One-Liners¶

# Run a specific Terratest test by name
go test -v -run TestSpecificModuleByName -timeout 30m ./test/

# List all Conftest policies and their packages
find policy/ -name "*.rego" | xargs -I {} head -1 {}

# Test Conftest policy against a local YAML file without Terraform
conftest test k8s/deployment.yaml --policy policy/k8s/

# InSpec check list (what controls exist in a profile)
inspec check profiles/my-baseline/ && inspec exec profiles/my-baseline/ --dry-run

# Validate all Rego files parse correctly
for f in $(find policy/ -name "*.rego"); do opa check "$f" && echo "OK: $f" || echo "FAIL: $f"; done

# kubeconform with schema for Argo Rollouts CRD
kubeconform \
  -schema-location default \
  -schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
  -summary k8s/

# Run InSpec and output only failed controls
inspec exec profiles/baseline/ --reporter json 2>/dev/null | \
  jq '.profiles[].controls[] | select(.results[].status == "failed") | {id: .id, title: .title}'

# Get Terraform plan JSON and check for any destroy actions
terraform show -json plan.tfplan | jq '.resource_changes[] | select(.change.actions | contains(["delete"])) | {address: .address, actions: .change.actions}'