Skip to content

Anti-Primer: Terraform

Everything that can go wrong, will — and in this story, it does.

The Setup

A cloud team is migrating from manually provisioned AWS resources to Terraform. Friday afternoon, two engineers are both working on the same Terraform state to meet a Monday deadline for the new VPC setup.

The Timeline

Hour 0: Apply Without Reading Plan

Runs terraform apply -auto-approve in CI without reviewing the plan output. The deadline was looming, and this seemed like the fastest path forward. But the result is Terraform replaces the production RDS instance, triggering a full database rebuild.

Footgun #1: Apply Without Reading Plan — runs terraform apply -auto-approve in CI without reviewing the plan output, leading to Terraform replaces the production RDS instance, triggering a full database rebuild.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: State File in Git

Commits terraform.tfstate because 'the whole team needs access'. Under time pressure, the team chose speed over caution. But the result is AWS access keys and database passwords are now in git history forever.

Footgun #2: State File in Git — commits terraform.tfstate because 'the whole team needs access', leading to AWS access keys and database passwords are now in git history forever.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: No State Locking

Two engineers run apply simultaneously against the same state. Nobody pushed back because the shortcut looked harmless in the moment. But the result is state corruption; orphaned resources that Terraform no longer tracks.

Footgun #3: No State Locking — two engineers run apply simultaneously against the same state, leading to state corruption; orphaned resources that Terraform no longer tracks.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Destroy on Wrong Workspace

Runs terraform destroy thinking they are in dev, but the workspace is prod. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is production VPC, subnets, and security groups are deleted.

Footgun #4: Destroy on Wrong Workspace — runs terraform destroy thinking they are in dev, but the workspace is prod, leading to production VPC, subnets, and security groups are deleted.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Apply Without Reading Plan Terraform replaces the production RDS instance, triggering a full database rebuild Primer: Always review plan output; use plan files in CI
2 State File in Git AWS access keys and database passwords are now in git history forever Primer: Remote state backend with encryption
3 No State Locking State corruption; orphaned resources that Terraform no longer tracks Primer: DynamoDB state locking for S3 backend
4 Destroy on Wrong Workspace Production VPC, subnets, and security groups are deleted Primer: Workspace confirmation checks and separate state files per environment

Damage Report

  • Downtime: 2-6 hours of infrastructure instability or drift
  • Data loss: Risk of data loss if stateful resources are replaced
  • Customer impact: Dependent services may experience outages or degraded performance
  • Engineering time to remediate: 12-24 engineer-hours for state recovery and drift remediation
  • Reputation cost: Infrastructure team credibility damaged; manual intervention required for future changes

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on apply without reading plan, they would have learned: Always review plan output; use plan files in CI.
  • Footgun #2: If the engineer had read the primer, section on state file in git, they would have learned: Remote state backend with encryption.
  • Footgun #3: If the engineer had read the primer, section on no state locking, they would have learned: DynamoDB state locking for S3 backend.
  • Footgun #4: If the engineer had read the primer, section on destroy on wrong workspace, they would have learned: Workspace confirmation checks and separate state files per environment.

Cross-References