Skip to content

Postmortem: S3 Bucket Policy Change Nearly Deletes All Backup Archives

Field Value
ID PM-029
Date 2025-09-03
Severity Near-Miss
Duration 0m (no customer impact)
Time to Detect 7m (caught during plan review)
Time to Mitigate 2h 14m (state reconciliation)
Customer Impact None
Revenue Impact None
Teams Involved Infrastructure Engineering, Data Engineering, Compliance, Security
Postmortem Author Nadia Ferrara
Postmortem Date 2025-09-05

Executive Summary

On 2025-09-03 at 11:18 UTC, Infrastructure Engineer Dariusz Kowalczyk was running a terraform plan for a routine S3 bucket encryption upgrade when he noticed an unexpected change in the plan output: the lifecycle policy governing transition to Glacier storage and retention of backup archives was being deleted. The root cause was Terraform state drift — someone had manually modified the bucket's lifecycle policy through the AWS Console, causing Terraform's recorded state to diverge from the actual AWS resource. When Terraform reconciled the in-code configuration against the drifted state, it generated a plan that would have removed the lifecycle policy entirely. Without that policy, the bucket's default 7-day object expiration rule would have applied to all 2.3TB of Glacier-tiered backup archives — deleting 18 months of database snapshots, audit logs, and compliance-required data after 7 days. Dariusz caught the destructive change because he read the full plan output rather than scrolling to the resource he intended to change.

Timeline (All times UTC)

Time Event
08:30 Unknown actor (determined later to be Data Engineering on-call) manually modifies the S3 lifecycle policy on vantage-backups-prod via AWS Console, changing the Glacier transition threshold from 90 days to 60 days; does not update Terraform state or open a change ticket
11:11 Dariusz Kowalczyk opens Terraform PR for s3-encryption-upgrade branch; runs terraform plan targeting module.s3_backups
11:14 Terraform plan output is 284 lines; Dariusz begins reviewing from the top
11:18 Dariusz reaches line 241 of the plan output: - lifecycle_rule block marked for deletion; reads the full block and recognizes it as the backup retention policy
11:19 Dariusz runs terraform state show aws_s3_bucket_lifecycle_configuration.backups and sees the state does not match the real AWS resource; concludes state drift
11:21 Dariusz does NOT run terraform apply; opens Slack thread in #infra-incidents: "Found state drift in S3 backup bucket lifecycle policy — plan would delete it. Holding apply. Need eyes on this."
11:25 Infrastructure lead Nadia Ferrara joins thread; confirms the severity (7-day expiry would apply to Glacier archives)
11:30 Data Engineering on-call Brendan Okafor identifies himself as the person who made the manual console change; explains he was testing a retention policy optimization and intended to update Terraform afterward
11:45 Nadia opens change ticket CHG-0892 to formally track the remediation
12:10 Dariusz runs terraform import to pull the real lifecycle configuration into Terraform state
12:55 Terraform code updated in s3-encryption-upgrade branch to reflect the intended 60-day transition; new terraform plan shows only the encryption change (zero lifecycle changes)
13:32 Plan reviewed and approved by Nadia and Security lead Olamide Adeyemi
13:35 terraform apply executed; S3 bucket encryption upgraded; lifecycle policy preserved and correct
14:00 Postmortem opened; change ticket CHG-0892 marked resolved
14:30 Compliance team (Radhika Sharma) notified of the near-miss; regulatory retention scope confirmed: 7-year retention required for audit logs and compliance snapshots

Impact

Customer Impact

None — terraform apply was not executed until the plan was corrected. No AWS resources were modified during the detection and remediation window.

Internal Impact

  • Dariusz Kowalczyk (Infrastructure Engineering): ~3 hours (plan review, investigation, state import, code update, apply)
  • Nadia Ferrara (Infrastructure lead): ~2 hours (incident coordination, plan review, approval)
  • Brendan Okafor (Data Engineering): ~1 hour (explaining the console change, reviewing corrected plan)
  • Olamide Adeyemi (Security): ~1 hour (plan approval, process review)
  • Radhika Sharma (Compliance): ~1 hour (regulatory scope assessment, process notification)
  • Total: approximately 8 engineering-hours

Data Impact

None. No data was deleted, modified, or moved. The 2.3TB of Glacier-tiered archives remain intact and correctly governed by the lifecycle policy.

What Would Have Happened

If Dariusz had run terraform apply without reading the full plan output — a common practice when a plan is long and the engineer is confident in their narrow change — the aws_s3_bucket_lifecycle_configuration.backups resource would have been deleted. This would have left the vantage-backups-prod bucket with only its baseline configuration, which includes a 7-day object expiration rule applied to all storage tiers.

AWS Glacier objects subject to an S3 object expiration rule are permanently deleted when the rule fires. There is no Glacier "undelete" operation and no AWS support path to recover objects after expiration. The 2.3TB of affected data includes: 18 months of nightly PostgreSQL RDS snapshots (the primary recovery point for Vantage Analytics' production database); 14 months of application audit logs (user actions, data access events); 6 months of SOC 2 compliance evidence packages; and 2 years of customer data export archives required for data subject request (DSR) fulfillment under GDPR Article 20.

The compliance scope is the most severe dimension of this scenario. Under SOC 2 Type II, audit logs must be retained for a minimum of 12 months (and Vantage's MSA with enterprise customers requires 7 years). Under GDPR Article 5(1)(e) and the company's data retention policy, DSR archives must be retained for 5 years. Deletion of these records would have been a material audit finding, potentially resulting in loss of SOC 2 certification, breach of enterprise customer MSAs, and GDPR regulatory action. Recovery would have been impossible — Glacier deletion is permanent — and the only remediation would have been an attestation of data loss to regulators and affected customers.

Beyond the compliance dimension, losing 18 months of RDS snapshots would have reduced the operational recovery capability to only the most recent automated backup (AWS automated backups, 35-day maximum retention). Any incident requiring point-in-time recovery beyond 35 days would have had no recovery option.

Root Cause

What Happened (Technical)

Terraform tracks AWS resource configurations in a state file (terraform.tfstate). When an engineer makes a change directly via the AWS Console or CLI — bypassing Terraform — the real AWS resource diverges from what Terraform has recorded in state. This is called state drift. On the next terraform plan that targets the affected resource, Terraform computes the diff between its in-code configuration and its recorded state, then proposes changes to bring the real resource in line with the code. Crucially: Terraform does not know the real resource has been modified. It sees only that its code and its state disagree, and it plans to make the real resource match the code.

In this case, Brendan had manually changed the lifecycle policy's Glacier transition threshold from 90 days to 60 days via the AWS Console. The Terraform code still had transition_in_days = 90. Terraform's state still reflected the old value (90 days). The real AWS resource now had 60 days. When Dariusz ran terraform plan to add encryption to the bucket, Terraform evaluated the full module.s3_backups module — including the lifecycle resource. It saw code saying 90 days, state saying 90 days (matching), and correctly determined no change was needed for that field.

However, Brendan's console modification had also added the lifecycle rule using a slightly different API representation. The lifecycle_rule block's id field — set implicitly by the console — did not match the id value in Terraform's code. Terraform interpreted the presence of an unrecognized id as a new resource and the absence of the expected id as a resource that had been deleted outside of Terraform. Its plan therefore showed: delete the Terraform-managed lifecycle rule (which it believed was no longer present in AWS), and create the new lifecycle rule with the correct 90-day threshold. The net effect of applying this plan would have been a window — potentially days — during which the bucket had no lifecycle rule at all while the replacement was being created, followed by a rule with the wrong threshold (90 instead of the intended 60).

More critically, the deletion of the current lifecycle rule would have removed the 7-day expiration override, and depending on the bucket's default settings, the expiration rule might have fired on Glacier objects during the no-lifecycle window.

Contributing Factors

  1. Manual console change without Terraform state update: Brendan made a console change without creating a change ticket, updating Terraform code, or running terraform state import to sync the state file. This is a well-understood anti-pattern ("ClickOps in a Terraform-managed environment") but had not been enforced via guardrails. The organization had a policy against it but no technical controls.
  2. Long plan output reduces effective review probability: The terraform plan output was 284 lines because the full module was targeted. The lifecycle policy deletion appeared on line 241. Engineers reviewing long plans under time pressure tend to focus on the top (their intended change) and assume the rest is noise. Dariusz read the full output, but this was above-average diligence, not a reliable process.
  3. Destructive changes (resource deletion) not highlighted or flagged separately: Terraform's default output treats a - resource_block deletion the same as a ~ attribute_change. There was no CI step or workflow step that extracted and separately surfaced all destructive operations (deletions, replacements) for focused human review.

What We Got Lucky About

  1. Dariusz read 241 lines of plan output. The lifecycle policy deletion appeared on line 241 of a 284-line plan. The encryption change Dariusz was actually there to apply appeared in the first 40 lines. The normal behavior — reviewing the top of the output where the intended change is, confirming it looks right, and running apply — would have missed the lifecycle deletion entirely. Dariusz's habit of reading full plan output before applying is not universal on the team and is not enforced by any process.
  2. The console change happened 2.5 hours before the Terraform run. If Brendan had made the console change and Dariusz had run the Terraform plan on the same day within minutes, the incident investigation would have been significantly harder — the CloudTrail event would have been buried deeper. The 2.5-hour gap made the correlation straightforward once Brendan identified himself.

Detection

How We Detected

Dariusz caught the lifecycle policy deletion by reading the complete terraform plan output before running terraform apply. The deletion block appeared 40 lines below the intended encryption change. No automated tooling flagged it.

Why This Almost Wasn't Caught

There was no automated or process control that separately surfaced destructive operations in Terraform plans. The standard workflow on the team was to run terraform plan, review the output quickly for the expected change, and apply. A 284-line plan with a destructive change buried in line 241 is unlikely to be caught by an engineer who is focused on a specific resource. This near-miss was prevented by one engineer's individual diligence, which is not a reliable control.

Response

What Went Well

  1. Dariusz immediately stopped the apply workflow when he identified the destructive change and escalated to #infra-incidents before doing anything further. This is the correct instinct and prevented any accidental application.
  2. Brendan proactively identified himself as the source of the console change within 9 minutes of the Slack thread, which accelerated the root cause analysis significantly and avoided a lengthy CloudTrail audit.
  3. The corrected plan was reviewed by both an Infrastructure lead and the Security lead before apply. This multi-eyes review step for corrected plans was the right call given what was at stake.

What Could Have Gone Better

  1. The console change was made without a change ticket, Slack notification, or Terraform state update. There is an existing policy against untracked manual changes to Terraform-managed resources, but it is not enforced technically. The policy existed on paper only.
  2. The terraform plan workflow had no step that extracted and separately highlighted destructive operations. A one-line grep for # aws_ entries with will be destroyed or must be replaced would have surfaced the lifecycle deletion immediately, before a human read 241 lines.

Action Items

ID Action Priority Owner Status Due Date
PM029-01 Add CI step to all Terraform plan pipelines that extracts and posts all destructive operations (destroy, replace) as a separate summary comment on the PR, blocked from merge until a second engineer explicitly approves the destruction P0 Infrastructure Engineering In Progress 2025-09-10
PM029-02 Enable AWS Config rules and CloudTrail alerts for manual changes to S3 lifecycle policies on backup buckets; route alerts to #infra-incidents and open a drift-detection ticket automatically P0 Nadia Ferrara Open 2025-09-12
PM029-03 Enable S3 Object Lock (compliance mode) on vantage-backups-prod for objects in Glacier, with a retention period matching the regulatory minimum (7 years for compliance data) P0 Dariusz Kowalczyk Open 2025-09-17
PM029-04 Run terraform plan in drift-detection mode across all managed buckets weekly; alert on any unexpected state divergence P1 Infrastructure Engineering Open 2025-09-24
PM029-05 Publish and enforce "no console changes to Terraform-managed resources" policy with technical guardrail: SCP denying S3 lifecycle modifications outside the infrastructure CI/CD role P1 Security / Olamide Adeyemi Open 2025-09-30
PM029-06 Audit all backup and compliance S3 buckets for Object Lock enablement status and retention period compliance; report findings to Compliance team P1 Radhika Sharma + Infrastructure Open 2025-09-19

Lessons Learned

  1. Reading the full Terraform plan is not optional for production changes. Long plan outputs bury important information. The team cannot rely on individual engineers having the diligence to read 284-line outputs. The fix is a process that programmatically surfaces destructive operations — not a norm that requires heroic attention to detail.
  2. "ClickOps" in a Terraform-managed environment creates invisible time bombs. Manual console changes do not cause immediate harm, but they create state drift that can manifest as destructive operations in the next Terraform run — potentially weeks or months later, and in a context completely unrelated to the original change. Technical guardrails (SCPs, IAM deny policies) are more reliable than behavioral policies.
  3. For irreplaceable data, application-level immutability is the last line of defense. S3 Object Lock (compliance mode) would have made this scenario impossible: the lifecycle rule could have been deleted from Terraform, but the underlying Glacier objects would have been protected from deletion by a hardware-enforced retention policy regardless of what S3 bucket configuration said. Compliance data should be protected at the object level, not just at the bucket policy level.

Cross-References

  • Failure Pattern: Terraform State Drift / Destructive Plan Operation / Manual Infrastructure Change
  • Topic Packs: Terraform State Management, S3 Data Lifecycle, AWS Object Lock, Infrastructure as Code Governance
  • Runbook: INFRA-RB-007 — Terraform State Drift Recovery; INFRA-RB-012 — S3 Backup Integrity Verification
  • Decision Tree: Infrastructure Triage → Unexpected Terraform Plan Change → Is the change destructive? → Yes → Stop, investigate state drift before applying