Skip to content

Postmortem: Prometheus Cardinality Explosion from Debug Labels

Field Value
ID PM-018
Date 2025-07-22
Severity SEV-3
Duration 1h 30m (detection to resolution)
Time to Detect 20m
Time to Mitigate 1h 10m
Customer Impact None — external services functioned normally; monitoring went dark but no customer-facing functionality was instrumented through the affected Prometheus instance exclusively
Revenue Impact None
Teams Involved Backend Engineering (Checkout Squad), Observability Platform, SRE On-Call
Postmortem Author Fumiko Hayashi
Postmortem Date 2025-07-25

Executive Summary

On 2025-07-22 at 11:03 UTC, a new version of the checkout-service was deployed to production containing a Prometheus histogram label, request_id, that was intended as a temporary debugging aid. Because request_id is a UUID-per-request value, each HTTP request created a new unique time series, causing the cluster's Prometheus instance to grow from approximately 50,000 active time series to over 5 million within 20 minutes. Prometheus exhausted its 32 GiB memory limit and was OOM-killed by the container runtime. The cluster's entire monitoring stack went dark: no metrics, no alert evaluation, and Alertmanager firing only on stale state. The root cause was identified by correlating the monitoring outage with the checkout-service deployment, the offending label was removed, Prometheus was restarted with a clean TSDB head block, and monitoring was restored by 12:33 UTC.

Timeline (All times UTC)

Time Event
11:03 checkout-service v2.14.1 deployed to production via Argo CD; deployment completes successfully, pods pass health checks
11:04 checkout-service begins emitting http_request_duration_seconds histogram with 4 labels: method, status_code, route, request_id — the last label is new in this version
11:07 Prometheus time series count (metric: prometheus_tsdb_head_series) crosses 200K; still within normal operating range
11:15 Prometheus time series count crosses 2M; Prometheus memory usage rises from 8 GiB to 22 GiB
11:18 Grafana dashboards for checkout-service begin rendering slowly; dashboard load times exceed 15 seconds
11:23 Prometheus OOM-killed by kubelet (container memory limit: 32 GiB exceeded); Prometheus pod enters CrashLoopBackOff
11:24 All Grafana dashboards go blank; panels show "No data" — cached last 5 minutes of data still visible in browser sessions that were open at 11:23
11:25 Alertmanager continues firing alerts based on its last received state, but no new alert evaluations occur
11:27 SRE on-call Dmitri Volkov notices Grafana is blank while reviewing an unrelated dashboard; checks Prometheus pod status
11:28 Dmitri: kubectl get pods -n monitoring — sees prometheus-0 in CrashLoopBackOff; checks pod logs before OOM
11:30 Dmitri pages Observability Platform team; Fumiko Hayashi joins
11:31 Fumiko checks prometheus_tsdb_head_series in the last cached Grafana data point (11:23): value is 5.2M; baseline was 50K
11:33 Fumiko runs kubectl top pods -n monitoring — Prometheus was consuming 31.8 GiB before OOM
11:37 Fumiko queries TSDB meta: kubectl exec prometheus-0 -- promtool tsdb analyze /prometheus — output shows checkout-service http_request_duration_seconds bucket has 4.8M series
11:40 Fumiko checks Argo CD deploy history; checkout-service v2.14.1 deployed at 11:03 is the only change in the window
11:42 Backend Engineering lead Anastasia Reeves is paged; Anastasia pulls v2.14.1 diff — confirms request_id label addition
11:45 Anastasia pushes hotfix v2.14.2: removes request_id label from histogram; retains label on log lines only
11:55 v2.14.2 deployed; checkout-service stops emitting request_id-labeled series
12:01 Prometheus restarted with --storage.tsdb.retention.time=0s flag and clean head block to force eviction of cardinality-bloated data
12:15 Prometheus starts successfully; begins scraping all targets; time series count at 52K (baseline)
12:22 Grafana dashboards populate with fresh data; SRE team verifies alert evaluation is functioning
12:33 Monitoring confirmed fully restored; incident declared resolved
12:45 Postmortem scheduled for 2025-07-25

Impact

Customer Impact

None. The checkout-service itself continued processing requests normally throughout the incident — the cardinality explosion affected Prometheus's ability to store and query metrics, not the application's ability to serve traffic. No customer-facing feature depended exclusively on this Prometheus instance for availability decisions (circuit breakers, canary traffic routing) that would have been impaired by the monitoring outage.

Internal Impact

  • 1 hour 9 minutes of complete monitoring blindness for the production cluster: no metrics queryable, no new alerts evaluable
  • During the monitoring blackout, a separate unrelated latency spike in inventory-service went undetected for approximately 40 minutes (it resolved on its own, but was not observed in real time)
  • Fumiko Hayashi (Observability Platform): ~3 hours of incident response and investigation
  • Dmitri Volkov (SRE On-Call): ~2.5 hours
  • Anastasia Reeves (Backend Engineering): ~1.5 hours for hotfix authoring, review, and deploy
  • Estimated 6 hours of aggregate engineering time across teams
  • Observability Platform team's planned cardinality governance tooling sprint was delayed by 1 week as the team used sprint time for incident response and action item scoping

Data Impact

Prometheus TSDB data for the 1-hour 9-minute window was partially corrupted due to the OOM during a write operation. The affected window's data was not recoverable from the local TSDB. Prometheus remote_write to the long-term storage backend (Thanos) had a 3-minute lag at the time of OOM, so data up to 11:20 UTC was persisted in Thanos. Data from 11:20 to 12:15 UTC is unavailable in long-term storage for all metrics in this cluster.

Root Cause

What Happened (Technical)

Prometheus stores time series as unique combinations of metric name and label set. A histogram metric like http_request_duration_seconds with labels {method, status_code, route} might produce 200–400 unique series (e.g., 3 methods × 10 routes × 14 status codes × 14 histogram buckets ≈ 5,880 series). This is high but manageable.

When engineer Callum Nguyen added request_id as a fourth label during a debugging session, he intended it as temporary observability for tracing a specific customer issue. He committed the change without removing it, and it passed code review without the reviewer recognizing the cardinality implication. Each HTTP request to checkout-service generates a unique UUID as its request_id. At 800 requests per second in production, checkout-service was creating 800 new unique time series every second — 48,000 per minute — each with 14 histogram bucket series, for a total of approximately 672,000 new unique series per minute.

Prometheus stores active (head) series in memory. Each series requires approximately 4–8 KB of in-memory state for the head block. At 5 million series, memory requirement exceeded 32 GiB. Prometheus attempted to garbage-collect old series but could not keep pace with the ingestion rate of novel series. The TSDB head compaction logic is designed for series churn over hours, not 800 new series per second.

When Prometheus OOM-killed, it was mid-way through a head block WAL (write-ahead log) write. On restart, Prometheus attempted WAL replay but the head block was partially written, causing the initial restart to also fail. The clean restart required forcibly discarding the head block (accepting data loss for the in-memory window) to break the crash loop.

Contributing Factors

  1. No metric cardinality alerting or enforcement: Prometheus has no built-in alert for cardinality growth rate or total series count threshold. An alert on prometheus_tsdb_head_series > 500000 or rate(prometheus_tsdb_head_series[5m]) > 10000 would have fired within minutes of 11:04 UTC, giving the team time to intervene before OOM.

  2. Code review did not flag the high-cardinality label: The PR adding request_id to the histogram was reviewed by Anastasia Reeves, who approved it. Neither Callum nor Anastasia recognized that adding a per-request unique value as a Prometheus label would create unbounded cardinality. There is no Prometheus label cardinality guidance in the Backend Engineering code review checklist or the team's observability standards document.

  3. No pre-deploy metric validation or cardinality estimation: The team has no tooling that analyzes metric definitions in a PR and estimates expected cardinality based on label value distributions. Such tooling (e.g., mimirtool analyze or a custom cardinality estimator) could have flagged request_id as high-risk before the code reached production.

What We Got Lucky About

  1. Grafana's browser-side data cache retained the last 5 minutes of dashboard data in open browser sessions. This meant that Dmitri, who had a Grafana tab open, could see the final cached value of prometheus_tsdb_head_series at 11:23 (5.2M), which was the single most important data point for quickly identifying the root cause. Without that cached value, the team would have had to reconstruct the cardinality explosion from Thanos data and Prometheus logs, adding 30–60 minutes to the investigation.

  2. The Prometheus pod's OOM logs (captured by the kubelet before the container was killed) included a partial dump of the top series by name, which confirmed http_request_duration_seconds from checkout-service as the dominant contributor. This log artifact was available via kubectl logs prometheus-0 --previous and proved decisive.

  3. The Argo CD deploy history provided a precise timestamp and diff for the only change in the blast window. If checkout-service had been deployed via a manual kubectl command without an audit trail, correlating the monitoring outage with the deployment would have taken significantly longer.

Detection

How We Detected

Detection was accidental. Dmitri noticed Grafana was blank while checking an unrelated dashboard for a routine weekly review. There were no automated alerts for the monitoring outage because the monitoring system itself was the thing that had failed — a self-referential blind spot. Alertmanager continued sending stale alerts, which created a false sense of health. Dmitri's manual observation at 11:27 was the first detection signal.

Why We Didn't Detect Sooner

The monitoring stack had no external health check or watchdog. A simple synthetic check that queries Prometheus's /-/healthy endpoint every 60 seconds from an external system (or a separate lightweight Prometheus instance used only for meta-monitoring) would have detected the OOM at 11:23. Instead, because Alertmanager continued firing stale alerts, there was no silence or anomaly in the pager queue that would have prompted earlier investigation. The system appeared to be alerting normally.

Response

What Went Well

  1. Fumiko's use of promtool tsdb analyze to inspect the TSDB directly (bypassing the broken query layer) was the right call and produced the root cause in under 5 minutes.
  2. The Argo CD deploy history made it straightforward to identify checkout-service v2.14.1 as the candidate change; the hypothesis was confirmed within 3 minutes of identifying the deployment.
  3. Anastasia's hotfix turnaround was fast: from notification to merged and deployed in 13 minutes, which is excellent for a production hotfix that required code change, review, and CI.

What Went Poorly

  1. Detection depended entirely on an engineer happening to look at Grafana. A 4-minute Prometheus OOM went undetected for 4 minutes, then was only caught accidentally 4 minutes later. The monitoring system had no watchdog.
  2. Code review did not catch the cardinality-destroying label. This is a knowledge gap that will recur without systemic remediation (checklist, linting, or automated analysis).
  3. The WAL corruption on restart extended the outage by approximately 14 minutes. The team was not familiar with the --storage.tsdb.no-lockfile and head-block discard procedure, requiring Fumiko to look it up during the incident.

Action Items

ID Action Priority Owner Status Due Date
AI-018-01 Add Prometheus meta-monitoring: external synthetic check on /-/healthy every 60s with PagerDuty integration; also alert if prometheus_tsdb_head_series > 1000000 Critical Fumiko Hayashi Open 2025-08-01
AI-018-02 Add cardinality check to CI: use mimirtool analyze (or equivalent) on any PR modifying metric definitions; fail PR if any label has estimated cardinality > 10,000 unique values High Fumiko Hayashi Open 2025-08-08
AI-018-03 Add Prometheus metric cardinality guidelines to Backend Engineering code review checklist: "Labels must not use per-request unique values (request IDs, user IDs, trace IDs)" High Anastasia Reeves Open 2025-07-31
AI-018-04 Document the Prometheus head-block discard recovery procedure in the SRE runbook; include the specific kubectl commands needed Medium Dmitri Volkov Open 2025-08-05
AI-018-05 Evaluate Prometheus per-series cardinality limits (e.g., --storage.tsdb.allow-overlapping-blocks, cardinality limiters in Prometheus 2.x) as a safety net against future explosions Medium Fumiko Hayashi Open 2025-08-15

Lessons Learned

  1. Prometheus labels are not log fields: Developers accustomed to structured logging (where adding a request_id field to a log line is free) often do not realize that Prometheus labels create entirely new time series per unique value. Every team that writes application metrics instrumentation needs to understand the cardinality model before they touch label sets in production.

  2. The monitoring system needs a monitor: A single Prometheus instance is a single point of observability failure. Meta-monitoring — a lightweight external health check that watches the watcher — is not optional in production environments. When the monitoring system is unhealthy, the pager queue going quiet is not a good sign; it is a blind spot.

  3. Fast hotfix deployment is a critical incident response capability: Anastasia's 13-minute turnaround from page to production deploy was the fastest path to resolution. Teams that cannot ship a one-line code change to production in under 20 minutes under incident conditions will always have longer outages for software-caused incidents. Investing in fast, safe hotfix deployment paths (blue/green, feature flags, canary) directly reduces MTTR.

Cross-References

  • Failure Pattern: Cardinality explosion / unbounded label values; monitoring blind spot (self-referential failure)
  • Topic Packs: Prometheus internals, metric cardinality, observability best practices, TSDB memory management
  • Runbook: runbooks/observability/prometheus-oom-recovery.md
  • Decision Tree: Triage → Grafana blank / "No data" → check kubectl get pods -n monitoring → if Prometheus CrashLoopBackOff, check kubectl logs prometheus-0 --previous → check TSDB head series count from Thanos → identify cardinality source via promtool tsdb analyze → revert offending metric → restart Prometheus with clean head block