Postmortem: Prometheus Cardinality Explosion from Debug Labels¶
| Field | Value |
|---|---|
| ID | PM-018 |
| Date | 2025-07-22 |
| Severity | SEV-3 |
| Duration | 1h 30m (detection to resolution) |
| Time to Detect | 20m |
| Time to Mitigate | 1h 10m |
| Customer Impact | None — external services functioned normally; monitoring went dark but no customer-facing functionality was instrumented through the affected Prometheus instance exclusively |
| Revenue Impact | None |
| Teams Involved | Backend Engineering (Checkout Squad), Observability Platform, SRE On-Call |
| Postmortem Author | Fumiko Hayashi |
| Postmortem Date | 2025-07-25 |
Executive Summary¶
On 2025-07-22 at 11:03 UTC, a new version of the checkout-service was deployed to production containing a Prometheus histogram label, request_id, that was intended as a temporary debugging aid. Because request_id is a UUID-per-request value, each HTTP request created a new unique time series, causing the cluster's Prometheus instance to grow from approximately 50,000 active time series to over 5 million within 20 minutes. Prometheus exhausted its 32 GiB memory limit and was OOM-killed by the container runtime. The cluster's entire monitoring stack went dark: no metrics, no alert evaluation, and Alertmanager firing only on stale state. The root cause was identified by correlating the monitoring outage with the checkout-service deployment, the offending label was removed, Prometheus was restarted with a clean TSDB head block, and monitoring was restored by 12:33 UTC.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 11:03 | checkout-service v2.14.1 deployed to production via Argo CD; deployment completes successfully, pods pass health checks |
| 11:04 | checkout-service begins emitting http_request_duration_seconds histogram with 4 labels: method, status_code, route, request_id — the last label is new in this version |
| 11:07 | Prometheus time series count (metric: prometheus_tsdb_head_series) crosses 200K; still within normal operating range |
| 11:15 | Prometheus time series count crosses 2M; Prometheus memory usage rises from 8 GiB to 22 GiB |
| 11:18 | Grafana dashboards for checkout-service begin rendering slowly; dashboard load times exceed 15 seconds |
| 11:23 | Prometheus OOM-killed by kubelet (container memory limit: 32 GiB exceeded); Prometheus pod enters CrashLoopBackOff |
| 11:24 | All Grafana dashboards go blank; panels show "No data" — cached last 5 minutes of data still visible in browser sessions that were open at 11:23 |
| 11:25 | Alertmanager continues firing alerts based on its last received state, but no new alert evaluations occur |
| 11:27 | SRE on-call Dmitri Volkov notices Grafana is blank while reviewing an unrelated dashboard; checks Prometheus pod status |
| 11:28 | Dmitri: kubectl get pods -n monitoring — sees prometheus-0 in CrashLoopBackOff; checks pod logs before OOM |
| 11:30 | Dmitri pages Observability Platform team; Fumiko Hayashi joins |
| 11:31 | Fumiko checks prometheus_tsdb_head_series in the last cached Grafana data point (11:23): value is 5.2M; baseline was 50K |
| 11:33 | Fumiko runs kubectl top pods -n monitoring — Prometheus was consuming 31.8 GiB before OOM |
| 11:37 | Fumiko queries TSDB meta: kubectl exec prometheus-0 -- promtool tsdb analyze /prometheus — output shows checkout-service http_request_duration_seconds bucket has 4.8M series |
| 11:40 | Fumiko checks Argo CD deploy history; checkout-service v2.14.1 deployed at 11:03 is the only change in the window |
| 11:42 | Backend Engineering lead Anastasia Reeves is paged; Anastasia pulls v2.14.1 diff — confirms request_id label addition |
| 11:45 | Anastasia pushes hotfix v2.14.2: removes request_id label from histogram; retains label on log lines only |
| 11:55 | v2.14.2 deployed; checkout-service stops emitting request_id-labeled series |
| 12:01 | Prometheus restarted with --storage.tsdb.retention.time=0s flag and clean head block to force eviction of cardinality-bloated data |
| 12:15 | Prometheus starts successfully; begins scraping all targets; time series count at 52K (baseline) |
| 12:22 | Grafana dashboards populate with fresh data; SRE team verifies alert evaluation is functioning |
| 12:33 | Monitoring confirmed fully restored; incident declared resolved |
| 12:45 | Postmortem scheduled for 2025-07-25 |
Impact¶
Customer Impact¶
None. The checkout-service itself continued processing requests normally throughout the incident — the cardinality explosion affected Prometheus's ability to store and query metrics, not the application's ability to serve traffic. No customer-facing feature depended exclusively on this Prometheus instance for availability decisions (circuit breakers, canary traffic routing) that would have been impaired by the monitoring outage.
Internal Impact¶
- 1 hour 9 minutes of complete monitoring blindness for the production cluster: no metrics queryable, no new alerts evaluable
- During the monitoring blackout, a separate unrelated latency spike in
inventory-servicewent undetected for approximately 40 minutes (it resolved on its own, but was not observed in real time) - Fumiko Hayashi (Observability Platform): ~3 hours of incident response and investigation
- Dmitri Volkov (SRE On-Call): ~2.5 hours
- Anastasia Reeves (Backend Engineering): ~1.5 hours for hotfix authoring, review, and deploy
- Estimated 6 hours of aggregate engineering time across teams
- Observability Platform team's planned cardinality governance tooling sprint was delayed by 1 week as the team used sprint time for incident response and action item scoping
Data Impact¶
Prometheus TSDB data for the 1-hour 9-minute window was partially corrupted due to the OOM during a write operation. The affected window's data was not recoverable from the local TSDB. Prometheus remote_write to the long-term storage backend (Thanos) had a 3-minute lag at the time of OOM, so data up to 11:20 UTC was persisted in Thanos. Data from 11:20 to 12:15 UTC is unavailable in long-term storage for all metrics in this cluster.
Root Cause¶
What Happened (Technical)¶
Prometheus stores time series as unique combinations of metric name and label set. A histogram metric like http_request_duration_seconds with labels {method, status_code, route} might produce 200–400 unique series (e.g., 3 methods × 10 routes × 14 status codes × 14 histogram buckets ≈ 5,880 series). This is high but manageable.
When engineer Callum Nguyen added request_id as a fourth label during a debugging session, he intended it as temporary observability for tracing a specific customer issue. He committed the change without removing it, and it passed code review without the reviewer recognizing the cardinality implication. Each HTTP request to checkout-service generates a unique UUID as its request_id. At 800 requests per second in production, checkout-service was creating 800 new unique time series every second — 48,000 per minute — each with 14 histogram bucket series, for a total of approximately 672,000 new unique series per minute.
Prometheus stores active (head) series in memory. Each series requires approximately 4–8 KB of in-memory state for the head block. At 5 million series, memory requirement exceeded 32 GiB. Prometheus attempted to garbage-collect old series but could not keep pace with the ingestion rate of novel series. The TSDB head compaction logic is designed for series churn over hours, not 800 new series per second.
When Prometheus OOM-killed, it was mid-way through a head block WAL (write-ahead log) write. On restart, Prometheus attempted WAL replay but the head block was partially written, causing the initial restart to also fail. The clean restart required forcibly discarding the head block (accepting data loss for the in-memory window) to break the crash loop.
Contributing Factors¶
-
No metric cardinality alerting or enforcement: Prometheus has no built-in alert for cardinality growth rate or total series count threshold. An alert on
prometheus_tsdb_head_series > 500000orrate(prometheus_tsdb_head_series[5m]) > 10000would have fired within minutes of 11:04 UTC, giving the team time to intervene before OOM. -
Code review did not flag the high-cardinality label: The PR adding
request_idto the histogram was reviewed by Anastasia Reeves, who approved it. Neither Callum nor Anastasia recognized that adding a per-request unique value as a Prometheus label would create unbounded cardinality. There is no Prometheus label cardinality guidance in the Backend Engineering code review checklist or the team's observability standards document. -
No pre-deploy metric validation or cardinality estimation: The team has no tooling that analyzes metric definitions in a PR and estimates expected cardinality based on label value distributions. Such tooling (e.g., mimirtool analyze or a custom cardinality estimator) could have flagged
request_idas high-risk before the code reached production.
What We Got Lucky About¶
-
Grafana's browser-side data cache retained the last 5 minutes of dashboard data in open browser sessions. This meant that Dmitri, who had a Grafana tab open, could see the final cached value of
prometheus_tsdb_head_seriesat 11:23 (5.2M), which was the single most important data point for quickly identifying the root cause. Without that cached value, the team would have had to reconstruct the cardinality explosion from Thanos data and Prometheus logs, adding 30–60 minutes to the investigation. -
The Prometheus pod's OOM logs (captured by the kubelet before the container was killed) included a partial dump of the top series by name, which confirmed
http_request_duration_secondsfromcheckout-serviceas the dominant contributor. This log artifact was available viakubectl logs prometheus-0 --previousand proved decisive. -
The Argo CD deploy history provided a precise timestamp and diff for the only change in the blast window. If
checkout-servicehad been deployed via a manualkubectlcommand without an audit trail, correlating the monitoring outage with the deployment would have taken significantly longer.
Detection¶
How We Detected¶
Detection was accidental. Dmitri noticed Grafana was blank while checking an unrelated dashboard for a routine weekly review. There were no automated alerts for the monitoring outage because the monitoring system itself was the thing that had failed — a self-referential blind spot. Alertmanager continued sending stale alerts, which created a false sense of health. Dmitri's manual observation at 11:27 was the first detection signal.
Why We Didn't Detect Sooner¶
The monitoring stack had no external health check or watchdog. A simple synthetic check that queries Prometheus's /-/healthy endpoint every 60 seconds from an external system (or a separate lightweight Prometheus instance used only for meta-monitoring) would have detected the OOM at 11:23. Instead, because Alertmanager continued firing stale alerts, there was no silence or anomaly in the pager queue that would have prompted earlier investigation. The system appeared to be alerting normally.
Response¶
What Went Well¶
- Fumiko's use of
promtool tsdb analyzeto inspect the TSDB directly (bypassing the broken query layer) was the right call and produced the root cause in under 5 minutes. - The Argo CD deploy history made it straightforward to identify
checkout-servicev2.14.1 as the candidate change; the hypothesis was confirmed within 3 minutes of identifying the deployment. - Anastasia's hotfix turnaround was fast: from notification to merged and deployed in 13 minutes, which is excellent for a production hotfix that required code change, review, and CI.
What Went Poorly¶
- Detection depended entirely on an engineer happening to look at Grafana. A 4-minute Prometheus OOM went undetected for 4 minutes, then was only caught accidentally 4 minutes later. The monitoring system had no watchdog.
- Code review did not catch the cardinality-destroying label. This is a knowledge gap that will recur without systemic remediation (checklist, linting, or automated analysis).
- The WAL corruption on restart extended the outage by approximately 14 minutes. The team was not familiar with the
--storage.tsdb.no-lockfileand head-block discard procedure, requiring Fumiko to look it up during the incident.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-018-01 | Add Prometheus meta-monitoring: external synthetic check on /-/healthy every 60s with PagerDuty integration; also alert if prometheus_tsdb_head_series > 1000000 |
Critical | Fumiko Hayashi | Open | 2025-08-01 |
| AI-018-02 | Add cardinality check to CI: use mimirtool analyze (or equivalent) on any PR modifying metric definitions; fail PR if any label has estimated cardinality > 10,000 unique values |
High | Fumiko Hayashi | Open | 2025-08-08 |
| AI-018-03 | Add Prometheus metric cardinality guidelines to Backend Engineering code review checklist: "Labels must not use per-request unique values (request IDs, user IDs, trace IDs)" | High | Anastasia Reeves | Open | 2025-07-31 |
| AI-018-04 | Document the Prometheus head-block discard recovery procedure in the SRE runbook; include the specific kubectl commands needed | Medium | Dmitri Volkov | Open | 2025-08-05 |
| AI-018-05 | Evaluate Prometheus per-series cardinality limits (e.g., --storage.tsdb.allow-overlapping-blocks, cardinality limiters in Prometheus 2.x) as a safety net against future explosions |
Medium | Fumiko Hayashi | Open | 2025-08-15 |
Lessons Learned¶
-
Prometheus labels are not log fields: Developers accustomed to structured logging (where adding a
request_idfield to a log line is free) often do not realize that Prometheus labels create entirely new time series per unique value. Every team that writes application metrics instrumentation needs to understand the cardinality model before they touch label sets in production. -
The monitoring system needs a monitor: A single Prometheus instance is a single point of observability failure. Meta-monitoring — a lightweight external health check that watches the watcher — is not optional in production environments. When the monitoring system is unhealthy, the pager queue going quiet is not a good sign; it is a blind spot.
-
Fast hotfix deployment is a critical incident response capability: Anastasia's 13-minute turnaround from page to production deploy was the fastest path to resolution. Teams that cannot ship a one-line code change to production in under 20 minutes under incident conditions will always have longer outages for software-caused incidents. Investing in fast, safe hotfix deployment paths (blue/green, feature flags, canary) directly reduces MTTR.
Cross-References¶
- Failure Pattern: Cardinality explosion / unbounded label values; monitoring blind spot (self-referential failure)
- Topic Packs: Prometheus internals, metric cardinality, observability best practices, TSDB memory management
- Runbook:
runbooks/observability/prometheus-oom-recovery.md - Decision Tree: Triage → Grafana blank / "No data" → check
kubectl get pods -n monitoring→ if Prometheus CrashLoopBackOff, checkkubectl logs prometheus-0 --previous→ check TSDB head series count from Thanos → identify cardinality source viapromtool tsdb analyze→ revert offending metric → restart Prometheus with clean head block