Anti-Primer: Terraform¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A cloud team is migrating from manually provisioned AWS resources to Terraform. Friday afternoon, two engineers are both working on the same Terraform state to meet a Monday deadline for the new VPC setup.
The Timeline¶
Hour 0: Apply Without Reading Plan¶
Runs terraform apply -auto-approve in CI without reviewing the plan output. The deadline was looming, and this seemed like the fastest path forward. But the result is Terraform replaces the production RDS instance, triggering a full database rebuild.
Footgun #1: Apply Without Reading Plan — runs
terraform apply -auto-approvein CI without reviewing the plan output, leading to Terraform replaces the production RDS instance, triggering a full database rebuild.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: State File in Git¶
Commits terraform.tfstate because 'the whole team needs access'. Under time pressure, the team chose speed over caution. But the result is AWS access keys and database passwords are now in git history forever.
Footgun #2: State File in Git — commits terraform.tfstate because 'the whole team needs access', leading to AWS access keys and database passwords are now in git history forever.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: No State Locking¶
Two engineers run apply simultaneously against the same state. Nobody pushed back because the shortcut looked harmless in the moment. But the result is state corruption; orphaned resources that Terraform no longer tracks.
Footgun #3: No State Locking — two engineers run apply simultaneously against the same state, leading to state corruption; orphaned resources that Terraform no longer tracks.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Destroy on Wrong Workspace¶
Runs terraform destroy thinking they are in dev, but the workspace is prod. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is production VPC, subnets, and security groups are deleted.
Footgun #4: Destroy on Wrong Workspace — runs
terraform destroythinking they are in dev, but the workspace is prod, leading to production VPC, subnets, and security groups are deleted.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Apply Without Reading Plan | Terraform replaces the production RDS instance, triggering a full database rebuild | Primer: Always review plan output; use plan files in CI |
| 2 | State File in Git | AWS access keys and database passwords are now in git history forever | Primer: Remote state backend with encryption |
| 3 | No State Locking | State corruption; orphaned resources that Terraform no longer tracks | Primer: DynamoDB state locking for S3 backend |
| 4 | Destroy on Wrong Workspace | Production VPC, subnets, and security groups are deleted | Primer: Workspace confirmation checks and separate state files per environment |
Damage Report¶
- Downtime: 2-6 hours of infrastructure instability or drift
- Data loss: Risk of data loss if stateful resources are replaced
- Customer impact: Dependent services may experience outages or degraded performance
- Engineering time to remediate: 12-24 engineer-hours for state recovery and drift remediation
- Reputation cost: Infrastructure team credibility damaged; manual intervention required for future changes
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on apply without reading plan, they would have learned: Always review plan output; use plan files in CI.
- Footgun #2: If the engineer had read the primer, section on state file in git, they would have learned: Remote state backend with encryption.
- Footgun #3: If the engineer had read the primer, section on no state locking, they would have learned: DynamoDB state locking for S3 backend.
- Footgun #4: If the engineer had read the primer, section on destroy on wrong workspace, they would have learned: Workspace confirmation checks and separate state files per environment.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice