The Terraform State Disaster¶
Category: The Migration Domains: terraform, infrastructure-as-code Read time: ~5 min
Setting the Scene¶
I was the platform engineer at a fintech startup that had grown from 5 to 80 engineers in two years. Our AWS infrastructure had been built entirely through the console — click-ops at scale. 3 VPCs, 14 subnets, 8 RDS instances, 22 EC2 instances, 6 ALBs, and about 40 security groups with names like "sg-quick-fix-prod" and "allow-everything-temp" (that one was 18 months old). The mandate was clear: get everything into Terraform so we could stop being terrified of the AWS console.
I estimated 3 weeks. It took 9 weeks and I almost deleted a production database along the way.
What Happened¶
Week 1 — I wrote Terraform configurations for the VPCs, subnets, and route tables. Used terraform import to bring them into state. VPC imports went fine. I got cocky. Started importing resources in batches of 10-15 at a time using a bash loop: for id in $(cat resource_ids.txt); do terraform import aws_security_group.sg[$id] $id; done. This was a mistake.
Week 2 — I imported the RDS instances. One of them, our production primary, had a parameter group with 47 custom settings. I wrote the Terraform aws_db_parameter_group resource but missed 3 parameters. When I ran terraform plan, it showed those 3 parameters as "will be destroyed." I almost ran terraform apply on autopilot. I caught it because the plan was 400 lines long and I was scrolling slowly. If I'd applied, it would have restarted the production database to change the parameter group. During business hours. On a Tuesday.
Week 3 — State file corruption. I was running imports from my laptop (no remote state yet, because I was "going to set that up after the import"). My laptop crashed mid-import. The .terraform.tfstate file was half-written. terraform state list returned a partial list. Some resources were in state but their attributes were empty. I had a backup from the night before, but 2 days of imports were lost. I re-did them, one at a time this time.
Week 4 — I set up remote state in S3 with DynamoDB locking. Migrated the local state with terraform state push. Immediately felt safer. Then a colleague, who was also doing imports in a different module, forgot about the lock and ran terraform import against a different state file. We now had two state files claiming to manage the same ALB. I discovered this when terraform plan in one module showed the ALB as "needs to be created" and the other showed it as "up to date."
Week 5-6 — Untangling. I used terraform state rm in the wrong module and terraform import in the right one. For each of the 6 duplicated resources, I had to figure out which module should own it. I built a spreadsheet — Resource ID, current state file, correct state file, import command, verification command. Very manual. Very slow.
Week 7-8 — The security groups. Forty of them, with rules referencing each other in circular dependencies. Terraform doesn't handle circular security group references natively. I had to use aws_security_group_rule as separate resources instead of inline rules, which meant rewriting 200 lines of HCL. Every terraform plan was terrifying because a wrong diff meant opening or closing ports on production.
Week 9 — Final validation. I wrote a Python script that compared every AWS resource (via boto3) against the Terraform state (via terraform show -json). Found 4 resources that existed in AWS but not in state, and 2 in state that no longer existed in AWS. Cleaned up both. Ran terraform plan one last time. "No changes. Your infrastructure matches the configuration." I almost cried.
The Moment of Truth¶
Week 3, staring at a corrupted state file on my laptop, knowing that if I'd been running against remote state with locking, this wouldn't have happened. The state file is Terraform's brain. Without it, Terraform doesn't know what's real and what isn't. And I'd been treating it like a disposable local file.
The Aftermath¶
Nine weeks later, 100% of our infrastructure was in Terraform. We had 4 state files (networking, compute, data, security), all in S3 with DynamoDB locking. terraform plan ran in CI on every PR. Nobody touched the AWS console for infrastructure changes anymore. The security group named "allow-everything-temp" got properly scoped. We found 3 other "temporary" rules that had been open for over a year.
The Lessons¶
- Import one resource at a time: Batch imports are fast but dangerous. Each import needs a
terraform planreview before the next one. One wrong attribute in a batch can cascade. - Lock your state file: Remote state with locking should be the FIRST thing you set up, not the last. Local state on a laptop is a disaster waiting for a power failure.
- Have a manual inventory as backup: A spreadsheet mapping AWS Resource ID to Terraform resource address saved me twice. When state gets confusing, a human-readable inventory is your safety net.
What I'd Do Differently¶
I'd set up remote state with locking before importing a single resource. I'd use terraform plan -target=<resource> after every import to verify just that resource. And I'd write the reconciliation script (comparing AWS reality to Terraform state) on day 1, not week 9. Knowing what's drifted at any point during the import process would have saved a week of detective work.
The Quote¶
"Terraform state is not a file. It's a contract between you and reality. Treat it like one."
Cross-References¶
- Topic Packs: Terraform, Terraform Deep Dive, AWS IAM