Skip to content

The Terraform Plan That Would Have Destroyed Prod

Category: The Close Call Domains: terraform, cloud-ops Read time: ~5 min


Setting the Scene

We were running Terraform 1.5 managing about 280 AWS resources: VPCs, RDS instances, ECS clusters, ALBs, S3 buckets — the full stack for a B2B SaaS product with 2,000 paying customers. Our CI pipeline ran terraform plan on every PR and posted the output as a comment. The rule was simple: review the plan, approve the PR, and the merge triggers terraform apply.

Except nobody was really reading the plans anymore. They were long, repetitive, and usually just said "No changes."

What Happened

On a Thursday morning, I opened a PR to add a new CloudWatch alarm. One resource. Five lines of HCL. Totally routine.

The CI bot posted the plan output. It was 847 lines long. I almost scrolled past it. But the summary line at the bottom caught my eye: "Plan: 43 to add, 0 to change, 43 to destroy."

Forty-three resources to destroy. For a PR that added one CloudWatch alarm.

I opened the full plan and my blood ran cold. Terraform wanted to destroy and recreate our production RDS cluster (3TB of customer data), both ECS services, the primary ALB, and 38 other resources. Every resource that someone had modified through the AWS console in the last six months was showing as "tainted" because the actual state didn't match the Terraform state.

The drift had accumulated silently. A senior engineer had manually scaled up the RDS instance class during a performance incident three months ago — db.r5.large to db.r5.2xlarge — and never updated the Terraform code. Someone else had added a security group rule through the console during an emergency. Another person had changed an S3 bucket policy directly. Each change was small and justified in the moment. Together, they were a time bomb.

If I'd done what we usually did — glance at the plan, approve, merge — Terraform would have destroyed and recreated our production database. Even with RDS snapshots, recovery would have taken 4-6 hours. During business hours.

The Moment of Truth

I caught it because I read one line: the summary. That's it. If the plan had said "1 to add, 0 to change, 0 to destroy," I would have approved it in 30 seconds. The only thing between us and a full production outage was the fact that I happened to notice a number that didn't match my expectations.

The Aftermath

We spent the next two weeks reconciling every piece of drift. We ran terraform import for resources that had been manually created, updated HCL to match actual state, and validated with terraform plan until it showed zero changes. Then we implemented three guardrails: terraform plan output now fails CI if destroy count exceeds zero without an explicit ALLOW_DESTROY=true label on the PR. We enabled AWS Config rules to detect console changes to Terraform-managed resources. And we added Spacelift's drift detection to run hourly and alert on any state mismatch.

The Lessons

  1. Always read the plan: terraform plan exists because terraform apply is irreversible for stateful resources. If you're not reading the plan, you're not using Terraform — you're playing Russian roulette with infrastructure.
  2. Never auto-apply: Any pipeline that runs terraform apply without a human reviewing the plan output is an outage waiting to happen. Auto-apply is for demos, not production.
  3. Drift detection should be automated: Manual console changes are inevitable during incidents. The question isn't whether drift will happen — it's whether you'll detect it before it causes a destroy/recreate cycle.

What I'd Do Differently

I'd set up OPA (Open Policy Agent) policies in the pipeline from day one. A policy like "no plan may destroy more than 3 resources without VP approval" would have caught this mechanically instead of relying on a human reading line 847 of a CI comment.

The Quote

"Terraform plan is not a formality. It's the last conversation you have with your infrastructure before you change it forever."

Cross-References