Anti-Primer: S3 Object Storage¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A storage engineer is configuring S3 Object Storage for a production workload that requires high durability and availability. The data is business-critical with regulatory retention requirements. The engineer is under pressure to complete the setup before a data migration next week.
The Timeline¶
Hour 0: No Replication Configured¶
Skips replication configuration to simplify the initial setup. The deadline was looming, and this seemed like the fastest path forward. But the result is a single disk failure causes data loss; 72-hour recovery effort from off-site backups.
Footgun #1: No Replication Configured — skips replication configuration to simplify the initial setup, leading to a single disk failure causes data loss; 72-hour recovery effort from off-site backups.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Backup Not Tested¶
Sets up automated backups but never tests a restore. Under time pressure, the team chose speed over caution. But the result is during a real disaster, the restore fails because the backup format is incompatible with the current version.
Footgun #2: Backup Not Tested — sets up automated backups but never tests a restore, leading to during a real disaster, the restore fails because the backup format is incompatible with the current version.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Capacity Planning Ignored¶
Provisions storage based on current needs without growth projections. Nobody pushed back because the shortcut looked harmless in the moment. But the result is storage fills to 95% in 6 months; performance degrades severely; emergency expansion under pressure.
Footgun #3: Capacity Planning Ignored — provisions storage based on current needs without growth projections, leading to storage fills to 95% in 6 months; performance degrades severely; emergency expansion under pressure.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Wrong RAID Level for Workload¶
Chooses RAID 5 for a write-heavy database workload. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is write amplification degrades performance by 60%; rebuild after a disk failure takes 18 hours.
Footgun #4: Wrong RAID Level for Workload — chooses RAID 5 for a write-heavy database workload, leading to write amplification degrades performance by 60%; rebuild after a disk failure takes 18 hours.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | No Replication Configured | A single disk failure causes data loss; 72-hour recovery effort from off-site backups | Primer: Configure replication from day one; treat it as a prerequisite, not an optimization |
| 2 | Backup Not Tested | During a real disaster, the restore fails because the backup format is incompatible with the current version | Primer: Regular restore tests; automated backup verification; document the restore procedure |
| 3 | Capacity Planning Ignored | Storage fills to 95% in 6 months; performance degrades severely; emergency expansion under pressure | Primer: Monitor growth rate; alert at 80% capacity; plan expansion before reaching 85% |
| 4 | Wrong RAID Level for Workload | Write amplification degrades performance by 60%; rebuild after a disk failure takes 18 hours | Primer: Match RAID level to workload characteristics; RAID 10 for write-heavy, RAID 5/6 for read-heavy |
Damage Report¶
- Downtime: 4-12 hours of storage unavailability or degraded I/O
- Data loss: High risk: data loss possible if replication or backups are inadequate
- Customer impact: All services dependent on the storage layer experience errors or outages
- Engineering time to remediate: 16-40 engineer-hours for recovery, data verification, and capacity planning
- Reputation cost: Severe: data durability concerns; possible compliance implications
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on no replication configured, they would have learned: Configure replication from day one; treat it as a prerequisite, not an optimization.
- Footgun #2: If the engineer had read the primer, section on backup not tested, they would have learned: Regular restore tests; automated backup verification; document the restore procedure.
- Footgun #3: If the engineer had read the primer, section on capacity planning ignored, they would have learned: Monitor growth rate; alert at 80% capacity; plan expansion before reaching 85%.
- Footgun #4: If the engineer had read the primer, section on wrong raid level for workload, they would have learned: Match RAID level to workload characteristics; RAID 10 for write-heavy, RAID 5/6 for read-heavy.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice