Anti-Primer: AWS Cloudwatch¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
An SRE team is setting up CloudWatch monitoring for a new production workload the week before launch. They copy alarm configurations from a dev account and adjust thresholds on the fly, skipping documentation.
The Timeline¶
Hour 0: Wrong Metric Namespace¶
Copies alarm definitions from dev but forgets the metric namespace differs in prod. The deadline was looming, and this seemed like the fastest path forward. But the result is alarms point at nonexistent metrics; no alerts fire during the first real incident.
Footgun #1: Wrong Metric Namespace — copies alarm definitions from dev but forgets the metric namespace differs in prod, leading to alarms point at nonexistent metrics; no alerts fire during the first real incident.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Missing Dimensions¶
Creates alarms without specifying instance or function dimensions. Under time pressure, the team chose speed over caution. But the result is alarm aggregates across all instances, masking per-instance failures.
Footgun #2: Missing Dimensions — creates alarms without specifying instance or function dimensions, leading to alarm aggregates across all instances, masking per-instance failures.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Log Group Retention Not Set¶
Creates log groups without setting retention; defaults to never expire. Nobody pushed back because the shortcut looked harmless in the moment. But the result is log storage costs grow 400% in 3 months; budget alert fires too late.
Footgun #3: Log Group Retention Not Set — creates log groups without setting retention; defaults to never expire, leading to log storage costs grow 400% in 3 months; budget alert fires too late.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Alarm Action Points to Wrong SNS Topic¶
Alarm triggers but the SNS topic routes to a dev Slack channel, not the on-call pager. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is critical production alert sits unread in a dev channel for 4 hours.
Footgun #4: Alarm Action Points to Wrong SNS Topic — alarm triggers but the SNS topic routes to a dev Slack channel, not the on-call pager, leading to critical production alert sits unread in a dev channel for 4 hours.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Wrong Metric Namespace | Alarms point at nonexistent metrics; no alerts fire during the first real incident | Primer: Verify metric namespaces per account and region |
| 2 | Missing Dimensions | Alarm aggregates across all instances, masking per-instance failures | Primer: Always specify the exact dimensions for the resource being monitored |
| 3 | Log Group Retention Not Set | Log storage costs grow 400% in 3 months; budget alert fires too late | Primer: Set retention policy on every log group at creation time |
| 4 | Alarm Action Points to Wrong SNS Topic | Critical production alert sits unread in a dev channel for 4 hours | Primer: Validate alarm actions point to production notification channels |
Damage Report¶
- Downtime: 3-6 hours of degraded or unavailable cloud services
- Data loss: Possible if storage or database resources were affected
- Customer impact: API errors, failed transactions, or service unavailability for end users
- Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
- Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on wrong metric namespace, they would have learned: Verify metric namespaces per account and region.
- Footgun #2: If the engineer had read the primer, section on missing dimensions, they would have learned: Always specify the exact dimensions for the resource being monitored.
- Footgun #3: If the engineer had read the primer, section on log group retention not set, they would have learned: Set retention policy on every log group at creation time.
- Footgun #4: If the engineer had read the primer, section on alarm action points to wrong sns topic, they would have learned: Validate alarm actions point to production notification channels.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice