Anti-Primer: AWS Troubleshooting¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

A production outage hits at 2 AM. The on-call engineer is troubleshooting an intermittent 503 from the ALB. CloudWatch dashboards are sparse, and the team has no runbooks for this failure mode.

The Timeline¶

Hour 0: Checking the Wrong Region¶

Spends 30 minutes looking at us-east-1 metrics when the service runs in us-west-2. The deadline was looming, and this seemed like the fastest path forward. But the result is the actual problem grows worse while the engineer investigates the wrong region.

Footgun #1: Checking the Wrong Region — spends 30 minutes looking at us-east-1 metrics when the service runs in us-west-2, leading to the actual problem grows worse while the engineer investigates the wrong region.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Restarting Instead of Diagnosing¶

Restarts the ECS service to 'fix' the 503s without capturing any diagnostic data. Under time pressure, the team chose speed over caution. But the result is problem recurs 2 hours later with no evidence to determine root cause.

Footgun #2: Restarting Instead of Diagnosing — restarts the ECS service to 'fix' the 503s without capturing any diagnostic data, leading to problem recurs 2 hours later with no evidence to determine root cause.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Ignoring Service Limits¶

Does not check AWS service quotas when scaling up during an incident. Nobody pushed back because the shortcut looked harmless in the moment. But the result is auto-scaling hits the EC2 instance limit; new instances fail to launch silently.

Footgun #3: Ignoring Service Limits — does not check AWS service quotas when scaling up during an incident, leading to auto-scaling hits the EC2 instance limit; new instances fail to launch silently.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Changing Multiple Things at Once¶

Modifies security groups, target group settings, and instance count simultaneously. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is 503s stop but the team cannot determine which change fixed it; the real fix is unknown.

Footgun #4: Changing Multiple Things at Once — modifies security groups, target group settings, and instance count simultaneously, leading to 503s stop but the team cannot determine which change fixed it; the real fix is unknown.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Checking the Wrong Region	The actual problem grows worse while the engineer investigates the wrong region	Primer: Always verify the region before starting any investigation
2	Restarting Instead of Diagnosing	Problem recurs 2 hours later with no evidence to determine root cause	Primer: Capture logs, metrics, and state before any remediation action
3	Ignoring Service Limits	Auto-scaling hits the EC2 instance limit; new instances fail to launch silently	Primer: Monitor service quotas and request increases proactively
4	Changing Multiple Things at Once	503s stop but the team cannot determine which change fixed it; the real fix is unknown	Primer: Change one thing at a time and observe the effect

Damage Report¶

Downtime: 3-6 hours of degraded or unavailable cloud services
Data loss: Possible if storage or database resources were affected
Customer impact: API errors, failed transactions, or service unavailability for end users
Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on checking the wrong region, they would have learned: Always verify the region before starting any investigation.
Footgun #2: If the engineer had read the primer, section on restarting instead of diagnosing, they would have learned: Capture logs, metrics, and state before any remediation action.
Footgun #3: If the engineer had read the primer, section on ignoring service limits, they would have learned: Monitor service quotas and request increases proactively.
Footgun #4: If the engineer had read the primer, section on changing multiple things at once, they would have learned: Change one thing at a time and observe the effect.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice