Anti-Primer: AWS EC2¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A team is migrating a legacy monolith to EC2 instances behind an ALB. The architect specified instance types months ago, and the engineer launching them trusts the spec without validating current pricing or availability.
The Timeline¶
Hour 0: No Termination Protection¶
Launches production instances without termination protection enabled. The deadline was looming, and this seemed like the fastest path forward. But the result is a cleanup script meant for dev terminates 3 production instances.
Footgun #1: No Termination Protection — launches production instances without termination protection enabled, leading to a cleanup script meant for dev terminates 3 production instances.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Wrong Security Group¶
Attaches a wide-open security group (0.0.0.0/0 on all ports) copied from dev. Under time pressure, the team chose speed over caution. But the result is port scan bots find exposed Redis and Postgres within hours.
Footgun #2: Wrong Security Group — attaches a wide-open security group (0.0.0.0/0 on all ports) copied from dev, leading to port scan bots find exposed Redis and Postgres within hours.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: EBS Volume Not Encrypted¶
Creates instances with default unencrypted EBS volumes. Nobody pushed back because the shortcut looked harmless in the moment. But the result is compliance audit fails; customer data on unencrypted disks requires full migration.
Footgun #3: EBS Volume Not Encrypted — creates instances with default unencrypted EBS volumes, leading to compliance audit fails; customer data on unencrypted disks requires full migration.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Instance Store Data Loss¶
Stores application data on instance store volumes thinking they are persistent. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is instance stop/start wipes all cached data; service cold-starts take 2 hours.
Footgun #4: Instance Store Data Loss — stores application data on instance store volumes thinking they are persistent, leading to instance stop/start wipes all cached data; service cold-starts take 2 hours.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | No Termination Protection | A cleanup script meant for dev terminates 3 production instances | Primer: Enable termination protection for all production instances |
| 2 | Wrong Security Group | Port scan bots find exposed Redis and Postgres within hours | Primer: Least-privilege security groups; never copy dev SGs to prod |
| 3 | EBS Volume Not Encrypted | Compliance audit fails; customer data on unencrypted disks requires full migration | Primer: Account-level default encryption for EBS |
| 4 | Instance Store Data Loss | Instance stop/start wipes all cached data; service cold-starts take 2 hours | Primer: Use EBS for persistent data; treat instance store as ephemeral |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on no termination protection, they would have learned: Enable termination protection for all production instances.
- Footgun #2: If the engineer had read the primer, section on wrong security group, they would have learned: Least-privilege security groups; never copy dev SGs to prod.
- Footgun #3: If the engineer had read the primer, section on ebs volume not encrypted, they would have learned: Account-level default encryption for EBS.
- Footgun #4: If the engineer had read the primer, section on instance store data loss, they would have learned: Use EBS for persistent data; treat instance store as ephemeral.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice