Skip to content

Anti-Primer: AWS Cloudwatch

Everything that can go wrong, will — and in this story, it does.

The Setup

An SRE team is setting up CloudWatch monitoring for a new production workload the week before launch. They copy alarm configurations from a dev account and adjust thresholds on the fly, skipping documentation.

The Timeline

Hour 0: Wrong Metric Namespace

Copies alarm definitions from dev but forgets the metric namespace differs in prod. The deadline was looming, and this seemed like the fastest path forward. But the result is alarms point at nonexistent metrics; no alerts fire during the first real incident.

Footgun #1: Wrong Metric Namespace — copies alarm definitions from dev but forgets the metric namespace differs in prod, leading to alarms point at nonexistent metrics; no alerts fire during the first real incident.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Missing Dimensions

Creates alarms without specifying instance or function dimensions. Under time pressure, the team chose speed over caution. But the result is alarm aggregates across all instances, masking per-instance failures.

Footgun #2: Missing Dimensions — creates alarms without specifying instance or function dimensions, leading to alarm aggregates across all instances, masking per-instance failures.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Log Group Retention Not Set

Creates log groups without setting retention; defaults to never expire. Nobody pushed back because the shortcut looked harmless in the moment. But the result is log storage costs grow 400% in 3 months; budget alert fires too late.

Footgun #3: Log Group Retention Not Set — creates log groups without setting retention; defaults to never expire, leading to log storage costs grow 400% in 3 months; budget alert fires too late.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Alarm Action Points to Wrong SNS Topic

Alarm triggers but the SNS topic routes to a dev Slack channel, not the on-call pager. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is critical production alert sits unread in a dev channel for 4 hours.

Footgun #4: Alarm Action Points to Wrong SNS Topic — alarm triggers but the SNS topic routes to a dev Slack channel, not the on-call pager, leading to critical production alert sits unread in a dev channel for 4 hours.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Wrong Metric Namespace Alarms point at nonexistent metrics; no alerts fire during the first real incident Primer: Verify metric namespaces per account and region
2 Missing Dimensions Alarm aggregates across all instances, masking per-instance failures Primer: Always specify the exact dimensions for the resource being monitored
3 Log Group Retention Not Set Log storage costs grow 400% in 3 months; budget alert fires too late Primer: Set retention policy on every log group at creation time
4 Alarm Action Points to Wrong SNS Topic Critical production alert sits unread in a dev channel for 4 hours Primer: Validate alarm actions point to production notification channels

Damage Report

  • Downtime: 3-6 hours of degraded or unavailable cloud services
  • Data loss: Possible if storage or database resources were affected
  • Customer impact: API errors, failed transactions, or service unavailability for end users
  • Engineering time to remediate: 12-24 engineer-hours across incident response, root cause analysis, and remediation
  • Reputation cost: Internal trust erosion; potential AWS billing surprises; customer-facing impact report required

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on wrong metric namespace, they would have learned: Verify metric namespaces per account and region.
  • Footgun #2: If the engineer had read the primer, section on missing dimensions, they would have learned: Always specify the exact dimensions for the resource being monitored.
  • Footgun #3: If the engineer had read the primer, section on log group retention not set, they would have learned: Set retention policy on every log group at creation time.
  • Footgun #4: If the engineer had read the primer, section on alarm action points to wrong sns topic, they would have learned: Validate alarm actions point to production notification channels.

Cross-References