Anti-Primer: AWS ECS¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A team is deploying their first ECS Fargate service for a customer-facing API. They are adapting a docker-compose setup and assume ECS works the same way. Launch is tomorrow.
The Timeline¶
Hour 0: Task Role vs Execution Role Confusion¶
Puts S3 permissions on the execution role instead of the task role. The deadline was looming, and this seemed like the fastest path forward. But the result is container can pull images but cannot access S3 at runtime; API returns 500s.
Footgun #1: Task Role vs Execution Role Confusion — puts S3 permissions on the execution role instead of the task role, leading to container can pull images but cannot access S3 at runtime; API returns 500s.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: No Health Check Grace Period¶
ALB health checks start immediately but the app takes 30 seconds to initialize. Under time pressure, the team chose speed over caution. But the result is ECS keeps killing and restarting tasks in an infinite loop.
Footgun #2: No Health Check Grace Period — ALB health checks start immediately but the app takes 30 seconds to initialize, leading to ECS keeps killing and restarting tasks in an infinite loop.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Hard-coded Task Count¶
Sets desired count to 2 without auto-scaling; traffic spikes 10x during a promotion. Nobody pushed back because the shortcut looked harmless in the moment. But the result is both tasks are saturated; response times hit 30 seconds; customers abandon.
Footgun #3: Hard-coded Task Count — sets desired count to 2 without auto-scaling; traffic spikes 10x during a promotion, leading to both tasks are saturated; response times hit 30 seconds; customers abandon.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Logging Driver Misconfigured¶
Forgets to configure the awslogs driver; container stdout goes nowhere. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is when the service fails, there are zero logs to diagnose the issue.
Footgun #4: Logging Driver Misconfigured — forgets to configure the awslogs driver; container stdout goes nowhere, leading to when the service fails, there are zero logs to diagnose the issue.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Task Role vs Execution Role Confusion | Container can pull images but cannot access S3 at runtime; API returns 500s | Primer: Understand task role (app permissions) vs execution role (ECS agent permissions) |
| 2 | No Health Check Grace Period | ECS keeps killing and restarting tasks in an infinite loop | Primer: Set health check grace period longer than app startup time |
| 3 | Hard-coded Task Count | Both tasks are saturated; response times hit 30 seconds; customers abandon | Primer: ECS service auto-scaling with target tracking |
| 4 | Logging Driver Misconfigured | When the service fails, there are zero logs to diagnose the issue | Primer: Always configure log driver and verify log group exists |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on task role vs execution role confusion, they would have learned: Understand task role (app permissions) vs execution role (ECS agent permissions).
- Footgun #2: If the engineer had read the primer, section on no health check grace period, they would have learned: Set health check grace period longer than app startup time.
- Footgun #3: If the engineer had read the primer, section on hard-coded task count, they would have learned: ECS service auto-scaling with target tracking.
- Footgun #4: If the engineer had read the primer, section on logging driver misconfigured, they would have learned: Always configure log driver and verify log group exists.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice