Postmortem: S3 Bucket Policy Change Nearly Deletes All Backup Archives¶
| Field | Value |
|---|---|
| ID | PM-029 |
| Date | 2025-09-03 |
| Severity | Near-Miss |
| Duration | 0m (no customer impact) |
| Time to Detect | 7m (caught during plan review) |
| Time to Mitigate | 2h 14m (state reconciliation) |
| Customer Impact | None |
| Revenue Impact | None |
| Teams Involved | Infrastructure Engineering, Data Engineering, Compliance, Security |
| Postmortem Author | Nadia Ferrara |
| Postmortem Date | 2025-09-05 |
Executive Summary¶
On 2025-09-03 at 11:18 UTC, Infrastructure Engineer Dariusz Kowalczyk was running a terraform plan for a routine S3 bucket encryption upgrade when he noticed an unexpected change in the plan output: the lifecycle policy governing transition to Glacier storage and retention of backup archives was being deleted. The root cause was Terraform state drift — someone had manually modified the bucket's lifecycle policy through the AWS Console, causing Terraform's recorded state to diverge from the actual AWS resource. When Terraform reconciled the in-code configuration against the drifted state, it generated a plan that would have removed the lifecycle policy entirely. Without that policy, the bucket's default 7-day object expiration rule would have applied to all 2.3TB of Glacier-tiered backup archives — deleting 18 months of database snapshots, audit logs, and compliance-required data after 7 days. Dariusz caught the destructive change because he read the full plan output rather than scrolling to the resource he intended to change.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 08:30 | Unknown actor (determined later to be Data Engineering on-call) manually modifies the S3 lifecycle policy on vantage-backups-prod via AWS Console, changing the Glacier transition threshold from 90 days to 60 days; does not update Terraform state or open a change ticket |
| 11:11 | Dariusz Kowalczyk opens Terraform PR for s3-encryption-upgrade branch; runs terraform plan targeting module.s3_backups |
| 11:14 | Terraform plan output is 284 lines; Dariusz begins reviewing from the top |
| 11:18 | Dariusz reaches line 241 of the plan output: - lifecycle_rule block marked for deletion; reads the full block and recognizes it as the backup retention policy |
| 11:19 | Dariusz runs terraform state show aws_s3_bucket_lifecycle_configuration.backups and sees the state does not match the real AWS resource; concludes state drift |
| 11:21 | Dariusz does NOT run terraform apply; opens Slack thread in #infra-incidents: "Found state drift in S3 backup bucket lifecycle policy — plan would delete it. Holding apply. Need eyes on this." |
| 11:25 | Infrastructure lead Nadia Ferrara joins thread; confirms the severity (7-day expiry would apply to Glacier archives) |
| 11:30 | Data Engineering on-call Brendan Okafor identifies himself as the person who made the manual console change; explains he was testing a retention policy optimization and intended to update Terraform afterward |
| 11:45 | Nadia opens change ticket CHG-0892 to formally track the remediation |
| 12:10 | Dariusz runs terraform import to pull the real lifecycle configuration into Terraform state |
| 12:55 | Terraform code updated in s3-encryption-upgrade branch to reflect the intended 60-day transition; new terraform plan shows only the encryption change (zero lifecycle changes) |
| 13:32 | Plan reviewed and approved by Nadia and Security lead Olamide Adeyemi |
| 13:35 | terraform apply executed; S3 bucket encryption upgraded; lifecycle policy preserved and correct |
| 14:00 | Postmortem opened; change ticket CHG-0892 marked resolved |
| 14:30 | Compliance team (Radhika Sharma) notified of the near-miss; regulatory retention scope confirmed: 7-year retention required for audit logs and compliance snapshots |
Impact¶
Customer Impact¶
None — terraform apply was not executed until the plan was corrected. No AWS resources were modified during the detection and remediation window.
Internal Impact¶
- Dariusz Kowalczyk (Infrastructure Engineering): ~3 hours (plan review, investigation, state import, code update, apply)
- Nadia Ferrara (Infrastructure lead): ~2 hours (incident coordination, plan review, approval)
- Brendan Okafor (Data Engineering): ~1 hour (explaining the console change, reviewing corrected plan)
- Olamide Adeyemi (Security): ~1 hour (plan approval, process review)
- Radhika Sharma (Compliance): ~1 hour (regulatory scope assessment, process notification)
- Total: approximately 8 engineering-hours
Data Impact¶
None. No data was deleted, modified, or moved. The 2.3TB of Glacier-tiered archives remain intact and correctly governed by the lifecycle policy.
What Would Have Happened¶
If Dariusz had run terraform apply without reading the full plan output — a common practice when a plan is long and the engineer is confident in their narrow change — the aws_s3_bucket_lifecycle_configuration.backups resource would have been deleted. This would have left the vantage-backups-prod bucket with only its baseline configuration, which includes a 7-day object expiration rule applied to all storage tiers.
AWS Glacier objects subject to an S3 object expiration rule are permanently deleted when the rule fires. There is no Glacier "undelete" operation and no AWS support path to recover objects after expiration. The 2.3TB of affected data includes: 18 months of nightly PostgreSQL RDS snapshots (the primary recovery point for Vantage Analytics' production database); 14 months of application audit logs (user actions, data access events); 6 months of SOC 2 compliance evidence packages; and 2 years of customer data export archives required for data subject request (DSR) fulfillment under GDPR Article 20.
The compliance scope is the most severe dimension of this scenario. Under SOC 2 Type II, audit logs must be retained for a minimum of 12 months (and Vantage's MSA with enterprise customers requires 7 years). Under GDPR Article 5(1)(e) and the company's data retention policy, DSR archives must be retained for 5 years. Deletion of these records would have been a material audit finding, potentially resulting in loss of SOC 2 certification, breach of enterprise customer MSAs, and GDPR regulatory action. Recovery would have been impossible — Glacier deletion is permanent — and the only remediation would have been an attestation of data loss to regulators and affected customers.
Beyond the compliance dimension, losing 18 months of RDS snapshots would have reduced the operational recovery capability to only the most recent automated backup (AWS automated backups, 35-day maximum retention). Any incident requiring point-in-time recovery beyond 35 days would have had no recovery option.
Root Cause¶
What Happened (Technical)¶
Terraform tracks AWS resource configurations in a state file (terraform.tfstate). When an engineer makes a change directly via the AWS Console or CLI — bypassing Terraform — the real AWS resource diverges from what Terraform has recorded in state. This is called state drift. On the next terraform plan that targets the affected resource, Terraform computes the diff between its in-code configuration and its recorded state, then proposes changes to bring the real resource in line with the code. Crucially: Terraform does not know the real resource has been modified. It sees only that its code and its state disagree, and it plans to make the real resource match the code.
In this case, Brendan had manually changed the lifecycle policy's Glacier transition threshold from 90 days to 60 days via the AWS Console. The Terraform code still had transition_in_days = 90. Terraform's state still reflected the old value (90 days). The real AWS resource now had 60 days. When Dariusz ran terraform plan to add encryption to the bucket, Terraform evaluated the full module.s3_backups module — including the lifecycle resource. It saw code saying 90 days, state saying 90 days (matching), and correctly determined no change was needed for that field.
However, Brendan's console modification had also added the lifecycle rule using a slightly different API representation. The lifecycle_rule block's id field — set implicitly by the console — did not match the id value in Terraform's code. Terraform interpreted the presence of an unrecognized id as a new resource and the absence of the expected id as a resource that had been deleted outside of Terraform. Its plan therefore showed: delete the Terraform-managed lifecycle rule (which it believed was no longer present in AWS), and create the new lifecycle rule with the correct 90-day threshold. The net effect of applying this plan would have been a window — potentially days — during which the bucket had no lifecycle rule at all while the replacement was being created, followed by a rule with the wrong threshold (90 instead of the intended 60).
More critically, the deletion of the current lifecycle rule would have removed the 7-day expiration override, and depending on the bucket's default settings, the expiration rule might have fired on Glacier objects during the no-lifecycle window.
Contributing Factors¶
- Manual console change without Terraform state update: Brendan made a console change without creating a change ticket, updating Terraform code, or running
terraform state importto sync the state file. This is a well-understood anti-pattern ("ClickOps in a Terraform-managed environment") but had not been enforced via guardrails. The organization had a policy against it but no technical controls. - Long plan output reduces effective review probability: The
terraform planoutput was 284 lines because the full module was targeted. The lifecycle policy deletion appeared on line 241. Engineers reviewing long plans under time pressure tend to focus on the top (their intended change) and assume the rest is noise. Dariusz read the full output, but this was above-average diligence, not a reliable process. - Destructive changes (resource deletion) not highlighted or flagged separately: Terraform's default output treats a
- resource_blockdeletion the same as a~ attribute_change. There was no CI step or workflow step that extracted and separately surfaced all destructive operations (deletions, replacements) for focused human review.
What We Got Lucky About¶
- Dariusz read 241 lines of plan output. The lifecycle policy deletion appeared on line 241 of a 284-line plan. The encryption change Dariusz was actually there to apply appeared in the first 40 lines. The normal behavior — reviewing the top of the output where the intended change is, confirming it looks right, and running apply — would have missed the lifecycle deletion entirely. Dariusz's habit of reading full plan output before applying is not universal on the team and is not enforced by any process.
- The console change happened 2.5 hours before the Terraform run. If Brendan had made the console change and Dariusz had run the Terraform plan on the same day within minutes, the incident investigation would have been significantly harder — the CloudTrail event would have been buried deeper. The 2.5-hour gap made the correlation straightforward once Brendan identified himself.
Detection¶
How We Detected¶
Dariusz caught the lifecycle policy deletion by reading the complete terraform plan output before running terraform apply. The deletion block appeared 40 lines below the intended encryption change. No automated tooling flagged it.
Why This Almost Wasn't Caught¶
There was no automated or process control that separately surfaced destructive operations in Terraform plans. The standard workflow on the team was to run terraform plan, review the output quickly for the expected change, and apply. A 284-line plan with a destructive change buried in line 241 is unlikely to be caught by an engineer who is focused on a specific resource. This near-miss was prevented by one engineer's individual diligence, which is not a reliable control.
Response¶
What Went Well¶
- Dariusz immediately stopped the apply workflow when he identified the destructive change and escalated to
#infra-incidentsbefore doing anything further. This is the correct instinct and prevented any accidental application. - Brendan proactively identified himself as the source of the console change within 9 minutes of the Slack thread, which accelerated the root cause analysis significantly and avoided a lengthy CloudTrail audit.
- The corrected plan was reviewed by both an Infrastructure lead and the Security lead before apply. This multi-eyes review step for corrected plans was the right call given what was at stake.
What Could Have Gone Better¶
- The console change was made without a change ticket, Slack notification, or Terraform state update. There is an existing policy against untracked manual changes to Terraform-managed resources, but it is not enforced technically. The policy existed on paper only.
- The
terraform planworkflow had no step that extracted and separately highlighted destructive operations. A one-linegrepfor# aws_entries withwill be destroyedormust be replacedwould have surfaced the lifecycle deletion immediately, before a human read 241 lines.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| PM029-01 | Add CI step to all Terraform plan pipelines that extracts and posts all destructive operations (destroy, replace) as a separate summary comment on the PR, blocked from merge until a second engineer explicitly approves the destruction | P0 | Infrastructure Engineering | In Progress | 2025-09-10 |
| PM029-02 | Enable AWS Config rules and CloudTrail alerts for manual changes to S3 lifecycle policies on backup buckets; route alerts to #infra-incidents and open a drift-detection ticket automatically |
P0 | Nadia Ferrara | Open | 2025-09-12 |
| PM029-03 | Enable S3 Object Lock (compliance mode) on vantage-backups-prod for objects in Glacier, with a retention period matching the regulatory minimum (7 years for compliance data) |
P0 | Dariusz Kowalczyk | Open | 2025-09-17 |
| PM029-04 | Run terraform plan in drift-detection mode across all managed buckets weekly; alert on any unexpected state divergence |
P1 | Infrastructure Engineering | Open | 2025-09-24 |
| PM029-05 | Publish and enforce "no console changes to Terraform-managed resources" policy with technical guardrail: SCP denying S3 lifecycle modifications outside the infrastructure CI/CD role | P1 | Security / Olamide Adeyemi | Open | 2025-09-30 |
| PM029-06 | Audit all backup and compliance S3 buckets for Object Lock enablement status and retention period compliance; report findings to Compliance team | P1 | Radhika Sharma + Infrastructure | Open | 2025-09-19 |
Lessons Learned¶
- Reading the full Terraform plan is not optional for production changes. Long plan outputs bury important information. The team cannot rely on individual engineers having the diligence to read 284-line outputs. The fix is a process that programmatically surfaces destructive operations — not a norm that requires heroic attention to detail.
- "ClickOps" in a Terraform-managed environment creates invisible time bombs. Manual console changes do not cause immediate harm, but they create state drift that can manifest as destructive operations in the next Terraform run — potentially weeks or months later, and in a context completely unrelated to the original change. Technical guardrails (SCPs, IAM deny policies) are more reliable than behavioral policies.
- For irreplaceable data, application-level immutability is the last line of defense. S3 Object Lock (compliance mode) would have made this scenario impossible: the lifecycle rule could have been deleted from Terraform, but the underlying Glacier objects would have been protected from deletion by a hardware-enforced retention policy regardless of what S3 bucket configuration said. Compliance data should be protected at the object level, not just at the bucket policy level.
Cross-References¶
- Failure Pattern: Terraform State Drift / Destructive Plan Operation / Manual Infrastructure Change
- Topic Packs: Terraform State Management, S3 Data Lifecycle, AWS Object Lock, Infrastructure as Code Governance
- Runbook: INFRA-RB-007 — Terraform State Drift Recovery; INFRA-RB-012 — S3 Backup Integrity Verification
- Decision Tree: Infrastructure Triage → Unexpected Terraform Plan Change → Is the change destructive? → Yes → Stop, investigate state drift before applying