Skip to content

Anti-Primer: AWS Troubleshooting

Everything that can go wrong, will — and in this story, it does.

The Setup

A production outage hits at 2 AM. The on-call engineer is troubleshooting an intermittent 503 from the ALB. CloudWatch dashboards are sparse, and the team has no runbooks for this failure mode.

The Timeline

Hour 0: Checking the Wrong Region

Spends 30 minutes looking at us-east-1 metrics when the service runs in us-west-2. The deadline was looming, and this seemed like the fastest path forward. But the result is the actual problem grows worse while the engineer investigates the wrong region.

Footgun #1: Checking the Wrong Region — spends 30 minutes looking at us-east-1 metrics when the service runs in us-west-2, leading to the actual problem grows worse while the engineer investigates the wrong region.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Restarting Instead of Diagnosing

Restarts the ECS service to 'fix' the 503s without capturing any diagnostic data. Under time pressure, the team chose speed over caution. But the result is problem recurs 2 hours later with no evidence to determine root cause.

Footgun #2: Restarting Instead of Diagnosing — restarts the ECS service to 'fix' the 503s without capturing any diagnostic data, leading to problem recurs 2 hours later with no evidence to determine root cause.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Ignoring Service Limits

Does not check AWS service quotas when scaling up during an incident. Nobody pushed back because the shortcut looked harmless in the moment. But the result is auto-scaling hits the EC2 instance limit; new instances fail to launch silently.

Footgun #3: Ignoring Service Limits — does not check AWS service quotas when scaling up during an incident, leading to auto-scaling hits the EC2 instance limit; new instances fail to launch silently.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Changing Multiple Things at Once

Modifies security groups, target group settings, and instance count simultaneously. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is 503s stop but the team cannot determine which change fixed it; the real fix is unknown.

Footgun #4: Changing Multiple Things at Once — modifies security groups, target group settings, and instance count simultaneously, leading to 503s stop but the team cannot determine which change fixed it; the real fix is unknown.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Checking the Wrong Region The actual problem grows worse while the engineer investigates the wrong region Primer: Always verify the region before starting any investigation
2 Restarting Instead of Diagnosing Problem recurs 2 hours later with no evidence to determine root cause Primer: Capture logs, metrics, and state before any remediation action
3 Ignoring Service Limits Auto-scaling hits the EC2 instance limit; new instances fail to launch silently Primer: Monitor service quotas and request increases proactively
4 Changing Multiple Things at Once 503s stop but the team cannot determine which change fixed it; the real fix is unknown Primer: Change one thing at a time and observe the effect

Damage Report

  • Downtime: 3-6 hours of degraded or unavailable cloud services
  • Data loss: Possible if storage or database resources were affected
  • Customer impact: API errors, failed transactions, or service unavailability for end users
  • Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
  • Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on checking the wrong region, they would have learned: Always verify the region before starting any investigation.
  • Footgun #2: If the engineer had read the primer, section on restarting instead of diagnosing, they would have learned: Capture logs, metrics, and state before any remediation action.
  • Footgun #3: If the engineer had read the primer, section on ignoring service limits, they would have learned: Monitor service quotas and request increases proactively.
  • Footgun #4: If the engineer had read the primer, section on changing multiple things at once, they would have learned: Change one thing at a time and observe the effect.

Cross-References