Infrastructure Testing — Street-Level Ops¶
Quick Diagnosis Commands¶
# Run Terratest tests (from the test/ directory in your module)
cd modules/vpc/test && go test -v -timeout 30m -run TestVpcModule
# Run multiple tests in parallel
go test -v -timeout 60m -parallel 4 ./test/...
# Run Conftest against Terraform plan
terraform plan -out=plan.tfplan && terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/terraform/
# Run InSpec against localhost
inspec exec ./profiles/my-baseline/
# Run InSpec against remote host
inspec exec ./profiles/my-baseline/ -t ssh://ubuntu@10.0.1.5 --key-files ~/.ssh/id_rsa
# Validate Kubernetes manifests with kubeconform
kubeconform -strict -summary k8s/
# Check Kyverno policies
kyverno apply policy/ --resource k8s/deployment.yaml
# Validate all YAML in a directory
find k8s/ -name "*.yaml" -exec yamllint -d '{extends: relaxed}' {} +
Gotcha: Terraform Destroy Fails — Resources Are Leaked¶
Rule: Always wrap Terraform apply with defer terraform.Destroy() in Terratest. But if the test panics before defer is registered, resources leak. Use a wrapper.
// BAD: if TestMyModule panics before defer, no cleanup
func TestMyModule(t *testing.T) {
opts := &terraform.Options{TerraformDir: "../"}
defer terraform.Destroy(t, opts) // not reached on early panic
// ... panic happens here ...
terraform.InitAndApply(t, opts)
}
// BETTER: register defer before anything can fail
func TestMyModule(t *testing.T) {
opts := &terraform.Options{TerraformDir: "../"}
terraform.Init(t, opts)
// Register cleanup immediately after init
defer terraform.Destroy(t, opts)
// Now apply
terraform.Apply(t, opts)
// assertions...
}
Also: track test resources with tags so you can find and delete leaked infra:
variable "test_run_id" {
description = "Unique ID for this test run — used to identify leaked resources"
type = string
default = "manual"
}
resource "aws_vpc" "test" {
cidr_block = var.vpc_cidr
tags = {
Name = "test-vpc-${var.test_run_id}"
ManagedBy = "terratest"
TestRunId = var.test_run_id
CreatedAt = timestamp()
}
}
uniqueID := random.UniqueId()
opts := &terraform.Options{
Vars: map[string]interface{}{
"test_run_id": uniqueID,
},
}
Gotcha: Conftest Policy Has No Effect¶
Rule: Conftest silently succeeds if it finds no input to check. Verify you're pointing at the right input format.
# This looks like it works but produces no output if plan.json is wrong format
conftest test plan.json --policy policy/
# Debug: check what conftest sees as input
conftest parse plan.json
Debug clue: When Conftest policies silently pass, run
opa eval --data policy/main.rego --input plan.json 'data.main.deny'directly. If the result is an empty array, your Rego rule conditions are not matching. The most common cause: Terraform plan JSON nests resource changes underresource_changes[_].change.after, not directly under the resource.
# Verify your Rego parses correctly
opa eval --data policy/main.rego --input plan.json 'data.main.deny'
# If deny is empty array, policies aren't matching — check your input structure
terraform show -json plan.tfplan | jq '.resource_changes[0]'
# Compare to what your Rego expects: input.resource_changes[_]
Pattern: CI Pipeline — Plan → Policy → Apply¶
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ['infra/**']
push:
branches: [main]
paths: ['infra/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Format Check
run: terraform fmt -check -recursive infra/
- name: Terraform Init
run: terraform init infra/
env:
TF_BACKEND_CONFIG: "bucket=my-tfstate"
- name: Terraform Validate
run: terraform validate infra/
- name: Terraform Plan
run: |
terraform plan -out=plan.tfplan infra/
terraform show -json plan.tfplan > plan.json
- name: Run Conftest Policies
run: |
conftest test plan.json --policy infra/policy/ --output json | tee conftest-results.json
# Fail if any deny rules triggered
jq -e '.[] | select(.failures | length > 0) | .failures[]' conftest-results.json && exit 1 || true
- name: Terraform Apply (main only)
if: github.ref == 'refs/heads/main'
run: terraform apply plan.tfplan infra/
terratest:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
needs: validate
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Run Terratest
run: go test -v -timeout 60m ./infra/modules/.../test/...
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-1
Pattern: InSpec Profile for Kubernetes Nodes¶
# profiles/k8s-node/controls/cis_k8s.rb
control 'k8s-node-01' do
impact 1.0
title 'Kubelet should not allow anonymous auth'
describe json('/var/lib/kubelet/config.yaml') do
its(['authentication', 'anonymous', 'enabled']) { should eq false }
end
end
control 'k8s-node-02' do
impact 1.0
title 'Kubelet authorization mode should not be AlwaysAllow'
describe json('/var/lib/kubelet/config.yaml') do
its(['authorization', 'mode']) { should_not eq 'AlwaysAllow' }
end
end
control 'k8s-node-03' do
impact 0.7
title 'Protect kernel defaults'
{
'net.bridge.bridge-nf-call-iptables' => '1',
'net.ipv4.ip_forward' => '1',
'net.bridge.bridge-nf-call-ip6tables' => '1',
}.each do |param, value|
describe kernel_parameter(param) do
its('value') { should eq value.to_i }
end
end
end
# Run against all nodes using SSH
for node in $(kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'); do
echo "=== Checking $node ==="
inspec exec profiles/k8s-node/ \
-t ssh://ubuntu@${node} \
--key-files ~/.ssh/cluster-key \
--reporter cli json:results/${node}.json
done
# Aggregate results
jq -s 'map(.profiles[].controls[]) | group_by(.id) | map({id: .[0].id, failures: map(select(.results[].status == "failed")) | length})' results/*.json
Scenario: Writing a Policy to Catch Cost Blowouts¶
# policy/terraform/expensive_resources.rego
package main
# Deny instance types larger than allowed
denied_instance_types := {
"p4d.24xlarge", "p3.16xlarge", "x1e.32xlarge",
"u-6tb1.metal", "u-9tb1.metal", "u-12tb1.metal"
}
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
resource.change.after.instance_type == denied_instance_types[_]
msg := sprintf(
"Instance '%s' uses prohibited instance type '%s' — get approval first",
[resource.address, resource.change.after.instance_type]
)
}
# Warn about resources without a cost-center tag
warn[msg] {
resource := input.resource_changes[_]
resource.change.actions[_] == "create"
not resource.change.after.tags["cost-center"]
msg := sprintf(
"Resource '%s' will be created without a cost-center tag",
[resource.address]
)
}
# Deny huge RDS instances without approval
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_db_instance"
resource.change.actions[_] == "create"
startswith(resource.change.after.instance_class, "db.r5.")
not resource.change.after.tags["approved-by"]
msg := sprintf(
"Large RDS instance '%s' requires 'approved-by' tag",
[resource.address]
)
}
# Test the policy
conftest test plan.json --policy policy/terraform/ --output json | \
jq '.[] | {filename: .filename, failures: .failures, warnings: .warnings}'
Emergency: Leaked Test Resources in AWS¶
# Find all resources tagged by terratest
aws resourcegroupstaggingapi get-resources \
--tag-filters Key=ManagedBy,Values=terratest \
--query 'ResourceTagMappingList[*].{ARN: ResourceARN, Tags: Tags}' \
--output json
# Find old test VPCs (older than 2 hours)
aws ec2 describe-vpcs \
--filters "Name=tag:ManagedBy,Values=terratest" \
--query 'Vpcs[*].{VpcId: VpcId, Name: Tags[?Key==`Name`].Value | [0], Created: Tags[?Key==`CreatedAt`].Value | [0]}' \
--output table
# Nuke all resources with a specific test run tag
# WARNING: destructive — verify first
War story: A team ran Terratest in CI without the
test_run_idtagging pattern. Over three months, hundreds of orphaned VPCs, subnets, and NAT gateways accumulated in their test account. The monthly bill grew from $200 to $4,800 before anyone noticed. The cleanup took two days because VPCs with dependencies (ENIs, endpoints, subnets) cannot be deleted in a single command. Tag your test resources from day one.
LEAKED_RUN_ID="abcd1234"
aws ec2 describe-vpcs \
--filters "Name=tag:TestRunId,Values=${LEAKED_RUN_ID}" \
--query 'Vpcs[*].VpcId' --output text | \
xargs -I {} aws ec2 delete-vpc --vpc-id {}
# Use aws-nuke for comprehensive cleanup (careful: very destructive)
# https://github.com/rebuy-de/aws-nuke
# aws-nuke -c nuke-config.yaml --no-dry-run
Useful One-Liners¶
# Run a specific Terratest test by name
go test -v -run TestSpecificModuleByName -timeout 30m ./test/
# List all Conftest policies and their packages
find policy/ -name "*.rego" | xargs -I {} head -1 {}
# Test Conftest policy against a local YAML file without Terraform
conftest test k8s/deployment.yaml --policy policy/k8s/
# InSpec check list (what controls exist in a profile)
inspec check profiles/my-baseline/ && inspec exec profiles/my-baseline/ --dry-run
# Validate all Rego files parse correctly
for f in $(find policy/ -name "*.rego"); do opa check "$f" && echo "OK: $f" || echo "FAIL: $f"; done
# kubeconform with schema for Argo Rollouts CRD
kubeconform \
-schema-location default \
-schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
-summary k8s/
# Run InSpec and output only failed controls
inspec exec profiles/baseline/ --reporter json 2>/dev/null | \
jq '.profiles[].controls[] | select(.results[].status == "failed") | {id: .id, title: .title}'
# Get Terraform plan JSON and check for any destroy actions
terraform show -json plan.tfplan | jq '.resource_changes[] | select(.change.actions | contains(["delete"])) | {address: .address, actions: .change.actions}'