Postmortem: No Review Gate on Terraform Destroy Leads to Wrong Account Teardown¶

Field	Value
ID	PM-012
Date	2025-05-13
Severity	SEV-2
Duration	2h 8m (detection to resolution)
Time to Detect	6m
Time to Mitigate	2h 8m
Customer Impact	Zero direct customer impact (staging environment); all 3 internal teams sharing staging lost environment access for 2 hours 2 minutes
Revenue Impact	None (staging only); indirect: ~$18,000 in engineering time for manual infrastructure reconstruction
Teams Involved	Infrastructure Engineering, Platform Engineering, Data Platform, QA Engineering, Security
Postmortem Author	Priya Nambiar
Postmortem Date	2025-05-16

Executive Summary¶

On May 13, 2025, an infrastructure engineer ran terraform destroy to tear down a personal dev environment but had AWS_PROFILE set to staging-us-east rather than dev-us-east. The command destroyed the VPC, all subnets, NAT gateways, and internet gateway for the shared staging account used by three internal teams. Services in staging went offline within 6 minutes as existing connections drained. Manual recreation of all networking infrastructure, including re-applying Terraform state for all dependent resources, took 2 hours and 8 minutes. The incident revealed that no policy or process gate existed to require plan review before a destroy operation, and that nearly identical profile names for dev and staging created a chronic confusion risk — the same naming pattern existed for production accounts.

Timeline (All times UTC)¶

Time	Event
14:02:17	Engineer Sebastián Morales runs `terraform destroy -auto-approve` from `infra/envs/dev-network` directory while logged in with `AWS_PROFILE=staging-us-east`
14:02:18	Terraform begins destroying resources: first targets are route table associations (no immediate impact)
14:02:45	NAT gateways destroyed; private-subnet services in staging begin losing outbound connectivity
14:03:10	Internet gateway destroyed; public-facing staging load balancers lose inbound connectivity
14:03:22	VPC and all 6 subnets destroyed; all staging resources (EC2, RDS, EKS node groups) are now network-isolated
14:03:30	Sebastián observes Terraform destroying resources he does not recognize; realizes the wrong account is targeted
14:03:35	Sebastián runs `Ctrl+C`; destruction is already complete (all 47 resources destroyed in 78 seconds)
14:08:00	QA Engineering's CI pipeline begins failing: "Unable to reach staging API endpoint"
14:08:44	PagerDuty alert fires: "staging environment health check failing"
14:09:10	On-call SRE Fatima Al-Rashidi acknowledges; confirms staging is fully unreachable
14:10:00	Sebastián reports to Slack `#incidents` channel with root cause: wrong AWS profile
14:11:30	Infrastructure Engineering lead Marcus Chen joins war room; assesses scope of destruction
14:15:00	Sebastián's full Terraform destroy log shared; team confirms all 47 networking resources are gone
14:18:00	Security team notified; begins reviewing CloudTrail for any lateral actions taken during the 78-second window
14:22:00	Decision: recreate networking from scratch using Terraform apply from the canonical staging-network module
14:25:00	Marcus checks Terraform state: `staging-network.tfstate` in S3 is intact (destroy updated it correctly)
14:27:00	`terraform apply` begins for `infra/envs/staging-network`; estimated 15-20 minutes for NAT gateways
14:44:30	VPC, subnets, internet gateway, and route tables recreated; NAT gateway provisioning still in progress
15:05:00	NAT gateways fully provisioned and associated; route tables updated
15:08:00	EKS nodes begin reconnecting; some pods in `CrashLoopBackOff` due to lost connection state
15:15:00	`terraform apply` run for all dependent stacks (EKS add-ons, RDS security groups, ALB) to reconcile state
15:55:00	All staging services healthy; CI pipelines unblocked; incident declared resolved
16:30:00	Security confirms no unauthorized actions observed in CloudTrail during the incident window

Impact¶

Customer Impact¶

No production customer traffic was affected. This was entirely contained to the staging environment.

Internal Impact¶

QA Engineering: staging CI/CD pipeline blocked for 2 hours 2 minutes; 4 engineers unable to run integration tests; 3 release candidates blocked from staging promotion
Data Platform: staging data pipeline runs failed; 2 engineers spent ~2 hours diagnosing failures before the incident was communicated broadly
Platform Engineering: staging feature deployments blocked; 1 sprint story missed its staging validation deadline
Infrastructure Engineering: Marcus Chen + 2 engineers spent 3.5 hours on reconstruction and validation
Total engineering cost estimate: ~24 engineering-hours (reconstruction + downstream team disruption)

Data Impact¶

No data loss in staging databases. RDS instances survived because they are not dependent on VPC for persistence — they became network-isolated but their underlying data was intact. All staging data was accessible again once networking was restored.

Root Cause¶

What Happened (Technical)¶

Sebastián Morales was cleaning up a dev environment he had provisioned two weeks earlier. His local ~/.aws/credentials file contained four profiles: dev-us-east, staging-us-east, prod-us-east, and dev-eu-west. He intended to run the destroy against dev-us-east but had set AWS_PROFILE=staging-us-east in his shell session earlier that day when investigating a staging issue, and had not reset it.

The Terraform working directory (infra/envs/dev-network) had a backend.tf pointing to an S3 bucket for state, but that bucket is shared across environments and the key (network/terraform.tfstate) was the same string used in both dev and staging module configurations. Because Sebastián had run terraform init in this directory several weeks prior when his profile was correctly set to dev-us-east, the local .terraform/ directory cached the correct state backend reference. However, the AWS provider used the current AWS_PROFILE environment variable, not the cached init profile — meaning it targeted the staging account for the actual API calls while reading what appeared to be the correct dev state.

Critically, terraform destroy -auto-approve skips the confirmation prompt entirely. The -auto-approve flag was used because Sebastián had automated the destroy step in a local cleanup script, and he ran the script without inspecting which profile was active.

The destroy completed in 78 seconds because Terraform parallelizes resource deletion aggressively. By the time Sebastián recognized the wrong account was targeted (at approximately T+13 seconds), all 47 resources were already destroyed. No Terraform Sentinel policy or AWS Service Control Policy (SCP) existed to block or require approval for VPC or subnet deletion in the staging account.

Contributing Factors¶

Similar AWS profile names without account-level safeguards: dev-us-east and staging-us-east differ by only one word and are easy to confuse in a shell session that persists for hours. No AWS account alias was configured that would have displayed the account name clearly in Terraform's output before destruction began. AWS STS get-caller-identity output was not required or checked before running the destroy.
-auto-approve in a cleanup script removes the last human checkpoint: The terraform destroy confirmation prompt exists as a deliberate speed bump. Sebastián's cleanup script bypassed this prompt unconditionally. No policy or code review requirement existed for local cleanup scripts, so this pattern had spread to several engineers.
No Terraform Sentinel or OPA policy preventing VPC destruction in non-dev accounts: Terraform Cloud Sentinel policies or Open Policy Agent rules could have detected that the targeted resources (VPC, subnets) are classified as shared infrastructure and required an additional approval step or blocked the operation outright. No such policies existed for the staging account. The same gap exists for production.

What We Got Lucky About¶

This was the staging account and not production. The exact same naming confusion (staging-us-east vs prod-us-east) exists in the same credentials file. If Sebastián had set his profile to prod-us-east during a previous session, the impact would have been a full production networking outage affecting all customers.
The Terraform state file in S3 was updated correctly by the destroy operation. This meant the reconstruction terraform apply had accurate state to work from and did not require manual state surgery. If the destroy had been interrupted mid-execution (which Sebastián attempted but failed to achieve), the state file would have been partially updated and reconstruction would have been significantly more complex.

Detection¶

How We Detected¶

The incident was detected by a synthetic monitoring check that issues HTTP GET requests to the staging environment's load balancer every 30 seconds. The check began failing at 14:08:00 (approximately 5 minutes after the destroy completed) and triggered a PagerDuty alert at 14:08:44.

Why We Didn't Detect Sooner¶

The 5-minute detection gap was caused by existing TCP connections to staging services surviving through the NAT gateway destruction — services with open connections continued serving traffic briefly from memory until those connections drained. The synthetic monitor has a 30-second interval, and the health check endpoint responds quickly enough that the first several failed checks were within the 30-second window of the internet gateway destruction. Additionally, Sebastián did not immediately report the error; he spent approximately 6 minutes reviewing the destroy log and confirming what had happened before posting to #incidents.

Response¶

What Went Well¶

Sebastián self-reported to #incidents quickly and with accurate root cause information, which saved the on-call SRE from spending time on root cause analysis.
The Terraform state file was intact and accurate, enabling straightforward reconstruction via terraform apply rather than manual resource creation.
The Security team was notified proactively and completed their CloudTrail review within 2 hours, confirming no unauthorized activity had occurred during the incident window.

What Went Poorly¶

The Data Platform team was not notified of the incident for approximately 40 minutes after the detection alert fired, causing two engineers to spend significant time debugging their own pipeline failures without knowing the root cause. The incident communication checklist did not include downstream teams using the staging environment.
NAT gateway provisioning takes 15–20 minutes and cannot be parallelized past a certain point — there is no way to accelerate this step. This was not known to most engineers on the call, leading to repeated "is it done yet?" interruptions and confusion about whether the apply had stalled.
No runbook existed for "staging networking destroyed — how to reconstruct." Marcus Chen had to work from memory and the Terraform module documentation, adding unnecessary cognitive load during a high-stress recovery.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM-012-01	Add AWS SCP to staging and production accounts: deny `ec2:DeleteVpc`, `ec2:DeleteSubnet`, `ec2:DeleteInternetGateway`, `ec2:DeleteNatGateway` unless caller has `infra-network-admin` IAM tag	P0	Infrastructure Engineering (Marcus Chen)	In Progress	2025-05-20
PM-012-02	Implement Terraform Sentinel policy requiring manual approval (no `-auto-approve`) for destroy operations targeting resources tagged `tier=shared` or `environment!=dev`	P0	Infrastructure Engineering (Marcus Chen)	In Progress	2025-05-23
PM-012-03	Rename AWS profiles to include account IDs: `dev-123456789` and `staging-987654321`; update all team documentation and onboarding guides	P1	Platform Engineering (Priya Nambiar)	Planned	2025-05-30
PM-012-04	Write runbook `runbook-staging-network-reconstruction.md` documenting step-by-step recovery from partial or complete VPC destruction	P1	Infrastructure Engineering (Sebastián Morales)	Planned	2025-05-23
PM-012-05	Add staging environment stakeholders to PagerDuty incident notification policy; all SEV-2+ incidents in staging should notify QA and Data Platform leads	P2	SRE (Fatima Al-Rashidi)	Planned	2025-05-21
PM-012-06	Add `aws sts get-caller-identity` output validation step to all Terraform wrapper scripts and CI jobs; fail fast if account ID does not match expected value	P1	Infrastructure Engineering (Marcus Chen)	Planned	2025-05-30

Lessons Learned¶

Profile name similarity is a chronic risk that account-level controls must cover: Human attention is unreliable under time pressure. Profile names that differ by one word will eventually be confused. The correct defense is a policy layer (SCP, Sentinel) that enforces intent independent of which credentials are active — not relying on engineers to check before acting.
-auto-approve is a footgun that belongs only in CI pipelines: The confirmation prompt in terraform destroy exists to give engineers a final moment of verification. Removing it from local workflows eliminates the only remaining safety net after a credentials mistake. Local scripts that wrap Terraform should require explicit confirmation or read account IDs from a config file and validate against the active profile.
Recovery time for cloud networking is bounded by AWS provisioning, not by team speed: NAT gateways take 15–20 minutes to provision regardless of how many engineers are working the incident. Teams should have realistic expectations for infrastructure reconstruction timelines, and runbooks should include these estimates so incident commanders can set accurate ETAs.

Cross-References¶

Failure Pattern: Wrong Environment Targeting; Missing Policy Gate on Destructive Operations
Topic Packs: terraform-operations (destroy safety, Sentinel policies, state management), aws-iam (SCPs, account organization), incident-response (communication, stakeholder notification)
Runbook: runbook-staging-network-reconstruction.md (to be created per PM-012-04), runbook-aws-profile-verification.md
Decision Tree: Staging unreachable → check load balancer health → check VPC/subnet existence → if networking destroyed, initiate Terraform reconstruction from canonical module; notify all staging stakeholders immediately