Skip to content

Anti-Primer: Prometheus Deep Dive

Everything that can go wrong, will — and in this story, it does.

The Setup

The observability team is setting up Prometheus monitoring for a high-traffic e-commerce platform during the week before Black Friday. They need to instrument 40 microservices and build dashboards for the war room. Time pressure is extreme.

The Timeline

Hour 0: Unbounded Label Cardinality

Adds user_id as a label on a request counter metric. The deadline was looming, and this seemed like the fastest path forward. But the result is Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic.

Footgun #1: Unbounded Label Cardinality — adds user_id as a label on a request counter metric, leading to Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Recording Rules

Builds dashboards with complex PromQL queries that aggregate across all instances in real time. Under time pressure, the team chose speed over caution. But the result is dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents.

Footgun #2: No Recording Rules — builds dashboards with complex PromQL queries that aggregate across all instances in real time, leading to dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Scrape Interval Mismatch

Sets 5-second scrape interval on 40 services without calculating resource impact. Nobody pushed back because the shortcut looked harmless in the moment. But the result is Prometheus storage fills in 3 days instead of 30; retention is silently truncated.

Footgun #3: Scrape Interval Mismatch — sets 5-second scrape interval on 40 services without calculating resource impact, leading to Prometheus storage fills in 3 days instead of 30; retention is silently truncated.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: No Alerting on Alerting

Does not monitor Prometheus itself or Alertmanager. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is alertmanager crashes silently; a real incident produces no alerts for 2 hours.

Footgun #4: No Alerting on Alerting — does not monitor Prometheus itself or Alertmanager, leading to alertmanager crashes silently; a real incident produces no alerts for 2 hours.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Unbounded Label Cardinality Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic Primer: Never use high-cardinality values as labels
2 No Recording Rules Dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents Primer: Recording rules for frequently used aggregations
3 Scrape Interval Mismatch Prometheus storage fills in 3 days instead of 30; retention is silently truncated Primer: Calculate storage requirements before changing scrape intervals
4 No Alerting on Alerting Alertmanager crashes silently; a real incident produces no alerts for 2 hours Primer: Meta-monitoring and dead man's switch alerts

Damage Report

  • Downtime: Monitoring blind spot lasting 2-12 hours
  • Data loss: None directly, but undetected incidents may cause downstream data loss
  • Customer impact: If a real incident occurs during the blind spot, customer impact goes undetected
  • Engineering time to remediate: 8-16 engineer-hours to restore monitoring and backfill gaps
  • Reputation cost: On-call team loses confidence in alerting; incident response times degrade

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on unbounded label cardinality, they would have learned: Never use high-cardinality values as labels.
  • Footgun #2: If the engineer had read the primer, section on no recording rules, they would have learned: Recording rules for frequently used aggregations.
  • Footgun #3: If the engineer had read the primer, section on scrape interval mismatch, they would have learned: Calculate storage requirements before changing scrape intervals.
  • Footgun #4: If the engineer had read the primer, section on no alerting on alerting, they would have learned: Meta-monitoring and dead man's switch alerts.

Cross-References