Anti-Primer: AWS S3 Deep Dive¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A data engineering team is building a data lake on S3. They need to store sensitive customer records alongside public marketing assets. The engineer uses a single bucket with prefix-based access control to save time.
The Timeline¶
Hour 0: Public Bucket Policy¶
Sets a bucket policy to public for the marketing prefix but the policy applies to the whole bucket. The deadline was looming, and this seemed like the fastest path forward. But the result is customer PII is publicly accessible; discovered by a security researcher who tweets about it.
Footgun #1: Public Bucket Policy — sets a bucket policy to public for the marketing prefix but the policy applies to the whole bucket, leading to customer PII is publicly accessible; discovered by a security researcher who tweets about it.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: No Versioning on Critical Data¶
Skips versioning to reduce storage costs on the customer data prefix. Under time pressure, the team chose speed over caution. But the result is a bulk delete job with a wrong prefix wipes 6 months of customer records with no recovery.
Footgun #2: No Versioning on Critical Data — skips versioning to reduce storage costs on the customer data prefix, leading to a bulk delete job with a wrong prefix wipes 6 months of customer records with no recovery.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Lifecycle Rule Deletes Too Soon¶
Creates a lifecycle rule to transition to Glacier after 30 days, delete after 90. Nobody pushed back because the shortcut looked harmless in the moment. But the result is compliance requires 7-year retention; data is deleted after 90 days and the audit fails.
Footgun #3: Lifecycle Rule Deletes Too Soon — creates a lifecycle rule to transition to Glacier after 30 days, delete after 90, leading to compliance requires 7-year retention; data is deleted after 90 days and the audit fails.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: No Server-Side Encryption Default¶
Relies on clients to encrypt uploads; some services skip encryption. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is 40% of objects are unencrypted; compliance scan flags the entire bucket.
Footgun #4: No Server-Side Encryption Default — relies on clients to encrypt uploads; some services skip encryption, leading to 40% of objects are unencrypted; compliance scan flags the entire bucket.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Public Bucket Policy | Customer PII is publicly accessible; discovered by a security researcher who tweets about it | Primer: Separate buckets for public and private data; use S3 Block Public Access |
| 2 | No Versioning on Critical Data | A bulk delete job with a wrong prefix wipes 6 months of customer records with no recovery | Primer: Enable versioning on all buckets with important data |
| 3 | Lifecycle Rule Deletes Too Soon | Compliance requires 7-year retention; data is deleted after 90 days and the audit fails | Primer: Align lifecycle rules with compliance requirements before enabling |
| 4 | No Server-Side Encryption Default | 40% of objects are unencrypted; compliance scan flags the entire bucket | Primer: Default encryption at the bucket level (SSE-S3 or SSE-KMS) |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on public bucket policy, they would have learned: Separate buckets for public and private data; use S3 Block Public Access.
- Footgun #2: If the engineer had read the primer, section on no versioning on critical data, they would have learned: Enable versioning on all buckets with important data.
- Footgun #3: If the engineer had read the primer, section on lifecycle rule deletes too soon, they would have learned: Align lifecycle rules with compliance requirements before enabling.
- Footgun #4: If the engineer had read the primer, section on no server-side encryption default, they would have learned: Default encryption at the bucket level (SSE-S3 or SSE-KMS).
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice