Anti-Primer: Prometheus Deep Dive¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

The observability team is setting up Prometheus monitoring for a high-traffic e-commerce platform during the week before Black Friday. They need to instrument 40 microservices and build dashboards for the war room. Time pressure is extreme.

The Timeline¶

Hour 0: Unbounded Label Cardinality¶

Adds user_id as a label on a request counter metric. The deadline was looming, and this seemed like the fastest path forward. But the result is Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic.

Footgun #1: Unbounded Label Cardinality — adds user_id as a label on a request counter metric, leading to Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Recording Rules¶

Builds dashboards with complex PromQL queries that aggregate across all instances in real time. Under time pressure, the team chose speed over caution. But the result is dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents.

Footgun #2: No Recording Rules — builds dashboards with complex PromQL queries that aggregate across all instances in real time, leading to dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Scrape Interval Mismatch¶

Sets 5-second scrape interval on 40 services without calculating resource impact. Nobody pushed back because the shortcut looked harmless in the moment. But the result is Prometheus storage fills in 3 days instead of 30; retention is silently truncated.

Footgun #3: Scrape Interval Mismatch — sets 5-second scrape interval on 40 services without calculating resource impact, leading to Prometheus storage fills in 3 days instead of 30; retention is silently truncated.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: No Alerting on Alerting¶

Does not monitor Prometheus itself or Alertmanager. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is alertmanager crashes silently; a real incident produces no alerts for 2 hours.

Footgun #4: No Alerting on Alerting — does not monitor Prometheus itself or Alertmanager, leading to alertmanager crashes silently; a real incident produces no alerts for 2 hours.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Unbounded Label Cardinality	Prometheus OOMs with 2 million time series; monitoring goes dark during peak traffic	Primer: Never use high-cardinality values as labels
2	No Recording Rules	Dashboard load takes 45 seconds; Prometheus query engine is saturated during incidents	Primer: Recording rules for frequently used aggregations
3	Scrape Interval Mismatch	Prometheus storage fills in 3 days instead of 30; retention is silently truncated	Primer: Calculate storage requirements before changing scrape intervals
4	No Alerting on Alerting	Alertmanager crashes silently; a real incident produces no alerts for 2 hours	Primer: Meta-monitoring and dead man's switch alerts

Damage Report¶

Downtime: Monitoring blind spot lasting 2-12 hours
Data loss: None directly, but undetected incidents may cause downstream data loss
Customer impact: If a real incident occurs during the blind spot, customer impact goes undetected
Engineering time to remediate: 8-16 engineer-hours to restore monitoring and backfill gaps
Reputation cost: On-call team loses confidence in alerting; incident response times degrade

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on unbounded label cardinality, they would have learned: Never use high-cardinality values as labels.
Footgun #2: If the engineer had read the primer, section on no recording rules, they would have learned: Recording rules for frequently used aggregations.
Footgun #3: If the engineer had read the primer, section on scrape interval mismatch, they would have learned: Calculate storage requirements before changing scrape intervals.
Footgun #4: If the engineer had read the primer, section on no alerting on alerting, they would have learned: Meta-monitoring and dead man's switch alerts.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice