Infrastructure Testing Footguns¶

Mistakes that give you false confidence in your infra, leak cloud resources, or produce flaky tests.

1. Not registering `defer terraform.Destroy()` immediately¶

You apply infrastructure, then do some setup work, then register the defer. If the setup panics, destroy never runs. Cloud resources leak. You find them days later on the billing report.

Fix: Register the defer immediately after init, before apply. If you need to pass opts to both init and destroy, init first, register defer, then apply.

func TestMyModule(t *testing.T) {
    opts := &terraform.Options{TerraformDir: "../"}
    terraform.Init(t, opts)
    defer terraform.Destroy(t, opts)   // register immediately after init
    terraform.Apply(t, opts)           // now apply — destroy will always run
}

2. Using shared Terraform state between parallel tests¶

You have 5 Terratest tests running in parallel. They all use the same S3 state file key. Test A's apply overwrites Test B's state. Both tests see unexpected resource state. One test destroys resources that belong to the other.

Fix: Use a unique state key per test run. Generate a unique ID and pass it as the backend config key.

uniqueID := random.UniqueId()
opts := &terraform.Options{
    TerraformDir: "../",
    BackendConfig: map[string]interface{}{
        "key": fmt.Sprintf("test-runs/%s/terraform.tfstate", uniqueID),
    },
    Vars: map[string]interface{}{
        "name_prefix": fmt.Sprintf("test-%s", uniqueID),
    },
}

3. Conftest policies that match the wrong JSON path¶

You write a policy targeting input.resource_changes[_].type. But your input is a Terraform plan from an older format, or you're passing a different JSON structure (raw Terraform config, not a plan). Conftest evaluates with zero matches and exits 0 — all clear. You think your policies passed; they never evaluated.

Fix: Always verify your Rego evaluates against actual data. Use opa eval to test:

opa eval --data policy/main.rego --input plan.json 'data.main.deny'
# If this returns [], your rules aren't matching — debug the input structure:
cat plan.json | jq '.resource_changes[0]'
# Compare what you see to what your Rego expects

Add a test to assert that at least one resource was checked:

# meta-policy: ensure we checked something
warn[msg] {
    count(input.resource_changes) == 0
    msg := "No resource changes in plan — did you pass the right file?"
}

4. Not handling transient AWS API errors in Terratest¶

Your test fails with RequestError: send request failed at 2am in CI. You investigate — it's a transient AWS API hiccup. The test was clean; the failure was noise. Your CI pipeline is now blocked.

Fix: Configure RetryableTerraformErrors in your options. Terratest will retry on known transient errors.

opts := &terraform.Options{
    TerraformDir: "../",
    RetryableTerraformErrors: map[string]string{
        "RequestError: send request failed":          "Transient AWS API",
        "ResourceNotFoundException":                  "Resource not yet visible",
        "ThrottlingException":                        "AWS throttling",
        "Error: timeout while waiting for state":     "Eventual consistency",
    },
    MaxRetries:         3,
    TimeBetweenRetries: 10 * time.Second,
}

5. InSpec controls with overly broad resource matching¶

Your InSpec control checks "all S3 buckets must have encryption." But your AWS account has 400 buckets, including legacy buckets in other teams' accounts that you can't control. Every run fails on those 15 legacy buckets. The noise makes people ignore the control.

Fix: Scope your controls to resources you own. Use tags, name prefixes, or explicit lists.

# BAD: checks all buckets in account
aws_s3_buckets.bucket_names.each do |bucket|
  describe aws_s3_bucket(bucket) do
    it { should have_default_encryption_enabled }
  end
end

# GOOD: only check buckets tagged as managed by this team
aws_s3_buckets.bucket_names
  .select { |b| aws_s3_bucket(b).has_tag?("ManagedBy", "platform-team") }
  .each do |bucket|
    describe aws_s3_bucket(bucket) do
      it { should have_default_encryption_enabled }
    end
  end

6. Writing Conftest policies against Terraform source instead of plan JSON¶

You write a policy that parses .tf files as JSON (they're HCL, not JSON). You use conftest test main.tf. It parses incorrectly or not at all. Your policy never catches the misconfiguration.

Fix: Conftest for Terraform must operate on the plan JSON output, not source files. The plan JSON contains the resolved values (after variables, data sources, etc. are evaluated).

# WRONG: source HCL is not valid JSON for Conftest
conftest test main.tf

# RIGHT: use plan JSON
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policy/

7. Not testing module outputs¶

You test that your Terraform module creates resources. But you don't test that the outputs are correct. Downstream modules that consume these outputs get garbage values. The VPC module outputs the wrong subnet IDs. The EKS cluster module consumes those and deploys to the wrong subnets. Everything looks fine until your pods can't route traffic.

Fix: Explicitly assert on every output that downstream consumers use.

// Don't just check resources exist — check outputs are correct
vpcID := terraform.Output(t, opts, "vpc_id")
assert.Regexp(t, `^vpc-[a-f0-9]+$`, vpcID, "vpc_id should be a valid VPC ID")

privateSubnets := terraform.OutputList(t, opts, "private_subnet_ids")
assert.Equal(t, 3, len(privateSubnets), "should have 3 private subnets")
for _, subnet := range privateSubnets {
    assert.Regexp(t, `^subnet-[a-f0-9]+$`, subnet)
}

8. Treating `terraform validate` as a correctness test¶

terraform validate checks syntax and type correctness. It doesn't catch: wrong CIDR ranges, missing tags, overly permissive security groups, resources in the wrong region, or anything requiring real API calls. A perfectly valid Terraform config can deploy completely broken infrastructure.

Fix: Use Terratest (apply + validate behavior), InSpec (check resulting state), and Conftest (policy gates on plan) in combination. validate is a syntax check, not a correctness check.

9. Hardcoding test resources that conflict across runs¶

Your test creates an S3 bucket named my-test-bucket. The first run creates it and destroys it. The second run (concurrent with the first) tries to create the same name. AWS says the bucket already exists. Both tests fail.

Fix: Always generate unique names with random.UniqueId(). Never hardcode resource names in tests.

// BAD
Vars: map[string]interface{}{
    "bucket_name": "my-test-bucket",
}

// GOOD
uniqueID := strings.ToLower(random.UniqueId())
Vars: map[string]interface{}{
    "bucket_name": fmt.Sprintf("test-bucket-%s", uniqueID),
}

10. Skipping infra tests in CI because they're "too slow"¶

Terratest is slow (10–30 minutes per test). So teams run it locally only, skip it in CI, and merge PRs without testing. Then a Terraform module change breaks a downstream module in production. "We would have caught that if we ran Terratest."

Fix: Run Terratest in CI on merge to main (not on PRs, to save time/cost). Keep tests focused on a single module. Use t.Parallel() to run multiple tests concurrently. Use Go test caching for stable tests.

# Only run expensive infra tests on main branch
terratest:
  if: github.ref == 'refs/heads/main'
  runs-on: ubuntu-latest
  timeout-minutes: 60
  steps:
    - run: go test -v -timeout 45m -parallel 4 ./infra/modules/.../test/

11. Not cleaning up InSpec target credentials¶

After running InSpec against production hosts with an SSH key or IAM credentials, you leave the credentials in CI environment variables indefinitely. Or the InSpec result file (with detailed system info) is uploaded as a CI artifact and is publicly accessible.

Fix: Rotate test credentials after each use. Mark InSpec result artifacts as private. Use temporary IAM credentials (AWS STS assume-role) that expire automatically.

# Use temporary credentials (expire after 1 hour)
CREDS=$(aws sts assume-role \
  --role-arn arn:aws:iam::123456789:role/InSpecAuditRole \
  --role-session-name "inspec-$(date +%s)" \
  --duration-seconds 3600)

AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.Credentials.AccessKeyId')
AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.Credentials.SecretAccessKey')
AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.Credentials.SessionToken')

inspec exec profiles/aws-baseline/ -t aws://us-east-1

12. Conftest policies that allow empty resource lists¶

Your policy: "deny if any EC2 instance has no tags." If the plan creates zero EC2 instances, the deny rule never fires. Conftest exits 0. You think "policy passed." But you're applying an RDS cluster (different resource type) and the policy you wanted doesn't apply to it.

Fix: Write explicit policies for each resource type you care about. Don't rely on "nothing was denied" as proof of compliance.

# BETTER: explicit check that tagged resources exist
deny[msg] {
    # Verify that if we're creating any production resources, they all have cost-center tag
    resource := input.resource_changes[_]
    resource.change.actions[_] == "create"
    resource.change.after.tags.environment == "production"
    not resource.change.after.tags["cost-center"]
    msg := sprintf("Production resource '%s' missing cost-center tag", [resource.address])
}