Skip to content

Postmortem: Production Database Deleted by Terraform Apply on Wrong Workspace

Field Value
ID PM-001
Date 2025-03-14
Severity SEV-1
Duration 4h 12m (detection to resolution)
Time to Detect 3m
Time to Mitigate 4h 12m
Customer Impact 100% of write operations failed for 4h 12m; ~47,000 active users could not place orders, update profiles, or complete checkout. Read operations degraded after read replicas began returning stale data at T+31m and went fully unavailable at T+58m.
Revenue Impact ~$310,000 estimated (4.2h × avg $74k/h transaction volume; excludes SLA credits)
Teams Involved Platform Engineering, Database Reliability (DRE), Site Reliability (SRE), Customer Success, Leadership
Postmortem Author Priya Anand (Staff SRE)
Postmortem Date 2025-03-18

Executive Summary

On 2025-03-14 at 14:22 UTC, a Platform Engineering engineer ran terraform apply in the production Terraform workspace while intending to apply changes to the development workspace. The apply destroyed the primary PostgreSQL RDS cluster (nexusdb-prod-primary) along with its associated parameter groups and subnet groups. Write traffic immediately began failing across all services backed by that cluster. Although a snapshot existed from 45 minutes prior, restoring the cluster and re-pointing all dependent services took 4 hours and 12 minutes. The incident exposed several systemic gaps: no workspace indicator in the terminal prompt, no approval gate for production applies, and RDS deletion protection that had been disabled during a maintenance window two weeks earlier and never re-enabled.


Timeline (All times UTC)

Time Event
14:19 Rafael Ochoa (Platform Engineering) opens two terminal tabs — one for tf-dev, one for tf-prod — to compare outputs. Begins intended terraform plan in dev.
14:21 Rafael switches tabs, believing he is in tf-dev. Runs terraform apply. Terminal prompt shows ~/infra/terraform with no workspace indicator.
14:22 Terraform begins destroying nexusdb-prod-primary (PostgreSQL 14.8, 12xlarge, Multi-AZ). No confirmation dialog for production workspaces. Apply completes in 41 seconds.
14:22 PagerDuty fires: [CRITICAL] RDS cluster nexusdb-prod-primary in state DELETING. On-call SRE (Tomoko Watase) acknowledges.
14:23 Application error rate climbs to 100% on all write endpoints. Order service, user service, and inventory service emit could not connect to server: Connection refused.
14:25 Tomoko pages DRE on-call (Marcus Ellroy). SEV-1 declared. War room opened in Slack #incident-live.
14:27 Rafael realizes the mistake, reports to war room. Terraform state is confirmed: nexusdb-prod-primary destroy is complete. RDS cluster is gone.
14:29 Marcus begins search for most recent automated snapshot. AWS Console shows snapshot rds:nexusdb-prod-primary-2025-03-14-1335 (taken at 13:35 UTC — 47 minutes prior).
14:33 DRE initiates restore from snapshot to nexusdb-prod-primary-restored. Estimated restore time: 35–45 minutes for the 4.2 TB volume.
14:38 Read replicas (nexusdb-prod-replica-1, -2, -3) begin returning stale data. Services receive replication lag warnings but continue serving reads.
14:55 Read replicas lose cluster membership and begin rejecting connections as replication source is gone. Read operations now also failing. Full database blackout.
15:01 Engineering leadership (CTO Amara Osei) joins war room. Customer Success begins composing customer notification.
15:09 Attempted workaround: route prod traffic to dev DB (read-only copy) rejected — schema version mismatch (dev is 3 migrations behind).
15:22 Snapshot restore at 60% complete per AWS Console estimate. DRE prepares connection string update for all services.
15:44 Restore completes. Cluster endpoint: nexusdb-prod-primary-restored.cluster-cxxxxxx.us-east-1.rds.amazonaws.com.
15:47 DRE updates SSM Parameter Store with new endpoint. Begins rolling restart of order service (18 pods).
15:58 Order service reconnects successfully. Writes confirmed. Other services begin rolling restarts in sequence: user service, inventory service, payment service.
16:02 All write endpoints recovering. Error rate drops from 100% → 12% as pods cycle.
16:09 DRE re-enables RDS deletion protection on restored cluster. Platform Engineering revokes terraform apply IAM permissions for prod workspace pending policy review.
16:22 Read replicas re-provisioned and replication lag reaches zero. All traffic nominal. Incident resolved.
16:34 Customer notification sent. Post-incident monitoring watch begins (24h).

Impact

Customer Impact

  • 47,000 active sessions unable to complete write operations for 4h 12m
  • ~12,400 abandoned checkout attempts recorded in session logs (incomplete; metric pipeline also degraded)
  • Read operations degraded from T+31m; full read blackout from T+58m to T+2h 22m
  • Mobile app users experienced hard crashes on stale session reconnect (iOS app does not gracefully handle DB 500s)
  • 3 enterprise customers (Veldtman Logistics, Harker Supply Co., Orinoco Retail Group) opened P1 tickets with Customer Success

Internal Impact

  • DRE: 2 engineers × 4.5h = 9 engineer-hours on restore
  • SRE: 3 engineers × 4.5h = 13.5 engineer-hours on response coordination
  • Platform Engineering: 2 engineers × 3h = 6 engineer-hours on triage + IAM rollback
  • Customer Success: 4 agents × 2h = 8 engineer-hours on customer communication
  • Planned Q1 platform hardening sprint delayed 1 week while teams focus on action items

Data Impact

  • 47 minutes of write data permanently lost (gap between last snapshot at 13:35 UTC and incident at 14:22 UTC)
  • Transactions during the gap (~3,100 order state updates, ~8,900 session events) were not recoverable; customers were notified individually by Customer Success
  • No payment data lost (payment records stored in a separate cluster with its own RDS instance)

Root Cause

What Happened (Technical)

The engineer had two terminal tabs open in VS Code's integrated terminal, both with working directory ~/infra/terraform. Terraform workspaces were set differently in each tab: tf-dev in Tab 1 and tf-prod in Tab 2. The VS Code terminal title showed only the directory path, not the active Terraform workspace. When the engineer switched tabs after reviewing the dev plan, they issued terraform apply in the prod tab. The shell prompt, configured via oh-my-zsh's agnoster theme, displayed the git branch and directory but had no Terraform workspace indicator.

Terraform's default behavior on apply is to require typing yes at a confirmation prompt. The engineer typed yes reflexively, having done so moments before in the dev tab. Because the prod workspace was configured identically in directory structure and the plan output was not scrutinized — it was assumed to be a repeat of the dev plan — the confirmation was given without noticing the production resource names in the diff.

The PostgreSQL RDS cluster was destroyed in 41 seconds. AWS RDS deletion protection is a flag on the RDS resource that, when enabled, prevents terraform destroy and AWS Console deletion from proceeding. This protection had been explicitly disabled on 2025-02-28 during a scheduled maintenance window to allow a parameter group replacement. The Jira ticket for that maintenance (PLAT-4421) noted "re-enable deletion protection after maintenance" in the description but not as a formal follow-up task. It was never re-enabled.

The snapshot schedule — a separate AWS Backup policy — ran independently of the deletion protection flag and continued to create snapshots on the configured 30-minute interval. The most recent snapshot at the time of the incident was 47 minutes old, which became the sole recovery artifact.

Because the RDS cluster was destroyed (not just stopped), all read replicas lost their source cluster and transitioned to an error state within approximately 33 minutes. At that point all read traffic also failed, elevating the incident from a write outage to a full database blackout.

Contributing Factors

  1. No workspace indicator in terminal or prompt: The shell prompt and VS Code terminal title gave no indication of which Terraform workspace was active. Engineers working across multiple workspaces had no ambient signal that they were targeting production.

  2. No apply approval gate for production workspace: Terraform Cloud's team-level approval workflow was configured for the tf-prod workspace but was not enforced for applies initiated from local CLI. The CLI bypass was intentional (for break-glass scenarios) but was not restricted to on-call engineers. Any engineer with AWS credentials and Terraform state access could run a production apply unilaterally.

  3. RDS deletion protection disabled and not tracked as a remediation task: The maintenance window that disabled deletion protection created no ticketed follow-up. The protection flag was not surfaced in any compliance or drift-detection scan. The team had discussed a Terraform drift alerting integration in Q4 but it was not yet implemented.

What We Got Lucky About

  1. The automated snapshot schedule ran independently of the deletion protection flag and was untouched by the incident. The 47-minute-old snapshot was the sole recovery path and it worked without corruption.
  2. Payment records are stored in a separate RDS cluster (nexuspay-prod-primary) not referenced in the destroyed Terraform state. No payment data was affected and no bank reconciliation was required.
  3. The data lost (47 minutes of order state and session events) was in a recoverable business category — Customer Success was able to contact affected users individually rather than facing irreversible financial or compliance data loss.

Detection

How We Detected

PagerDuty fired a critical alert 38 seconds after terraform apply was invoked, triggered by an AWS CloudWatch event rule watching for RDS cluster state changes to DELETING. The on-call SRE acknowledged within 90 seconds. Application-layer error rate alerts (threshold: >5% 5xx for 60 seconds) fired at T+1m as connection pools began exhausting.

Why We Didn't Detect Sooner

Detection was essentially immediate (3 minutes to full war room activation) — the CloudWatch event rule was well-configured. The gap is entirely in prevention, not detection. There was no pre-apply gate that would have surfaced a plan summary for human review before destructive changes were executed against production.


Response

What Went Well

  1. The CloudWatch event rule for RDS state changes fired within seconds and produced a clear, actionable alert. On-call SRE was in the war room within 3 minutes of the apply completing.
  2. The automated snapshot schedule continued running independently of deletion protection, providing a viable recovery artifact. The DRE team's familiarity with RDS snapshot restore procedures meant the restore was initiated within 7 minutes of confirmation that the cluster was gone.
  3. Rafael's immediate self-report to the war room eliminated ambiguity about root cause and allowed the team to focus entirely on recovery rather than spending time on hypothesis testing.
  4. SSM Parameter Store as the canonical source of DB connection strings allowed DRE to re-point all services by updating a single parameter rather than redeploying application config.

What Went Poorly

  1. The 47-minute RPO gap was not surfaced during incident response until Marcus ran the snapshot search — there was no documented, regularly tested RDS recovery runbook that would have prompted the team to immediately quantify data loss.
  2. The failed workaround (routing to dev DB) cost 13 minutes. A schema version check should be a documented pre-step in any emergency DB failover procedure.
  3. Read replica behavior after cluster deletion was not well understood by the SRE team. The team assumed replicas would continue serving stale reads indefinitely; they did not anticipate the hard failure at T+33m, which escalated the incident unexpectedly.
  4. Customer notification was delayed by 37 minutes because Customer Success needed legal sign-off on the language describing data loss.

Action Items

ID Action Priority Owner Status Due Date
AI-001 Add Terraform workspace name to shell PS1 and VS Code terminal title via TF_WORKSPACE env hook; enforce via onboarding checklist P0 Rafael Ochoa In Progress 2025-03-21
AI-002 Require Terraform Cloud run approval for all prod workspace applies, including CLI-initiated runs; remove break-glass CLI override or restrict to on-call principal only P0 Platform Engineering Lead (Soo-Jin Park) Not Started 2025-03-24
AI-003 Re-enable RDS deletion protection via Terraform resource config (deletion_protection = true) and add OPA policy check to CI to block any PR that sets it to false without a corresponding maintenance-window label P0 Marcus Ellroy (DRE) Not Started 2025-03-28
AI-004 Write and rehearse RDS point-in-time restore runbook; include explicit step for quantifying RPO gap before beginning restore; schedule quarterly GameDay drill P1 Tomoko Watase (SRE) Not Started 2025-04-11
AI-005 Implement Terraform drift detection (Atlantis or Terraform Cloud drift runs) to alert on any production resource whose configuration diverges from state, including deletion protection flags P1 Platform Engineering Not Started 2025-04-25
AI-006 Create pre-approved data-loss notification template with legal sign-off so Customer Success can notify within 10 minutes of SEV-1 data loss confirmation P2 Customer Success (Bianca Ferrara) Not Started 2025-04-04

Lessons Learned

  1. Ambient workspace signals prevent tab-confusion errors. Engineers routinely work across multiple terminal sessions targeting different environments. Without a persistent, visible indicator of which environment is active, tab-switching errors are a predictable human failure mode — not an individual mistake. Shell prompt and terminal title configuration should be a standard enforced by tooling, not a personal preference.

  2. Maintenance-window cleanup tasks must be tracked as first-class tickets, not prose comments. "Re-enable deletion protection after maintenance" in a ticket description is not a task. It will not appear in sprint planning, it will not be assigned, and it will not be done. Any state change made for a maintenance window that must be reversed requires a linked follow-up ticket created before the maintenance begins.

  3. Recovery runbooks must be tested to be trusted. The team discovered two gaps in the recovery path (RPO quantification, schema version mismatch check) only under incident pressure. Runbooks that have never been executed in a drill contain unknown failure modes. A quarterly GameDay that simulates RDS loss would have exposed both gaps in a low-pressure environment.


Cross-References

  • Failure Pattern: Human Error — Wrong Environment Target; Configuration Drift — Maintenance Cleanup Not Tracked
  • Topic Packs: terraform-workspaces, rds-backup-restore, incident-response-runbooks, iam-least-privilege
  • Runbook: runbooks/database/rds-cluster-restore-from-snapshot.md (to be created per AI-004)
  • Decision Tree: Triage path — "DB connection errors on all write endpoints" → check RDS cluster state in CloudWatch → if DELETING/DELETED, immediately initiate snapshot restore and page DRE; do not attempt application-layer workarounds before confirming DB state