Pattern: Metric Cardinality Explosion¶
ID: FP-040 Family: Observability Gap Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Subtle
The Shape¶
Prometheus (and similar time-series databases) store one time series per unique combination of metric name + label values. When a high-cardinality value (user ID, request ID, URL path with IDs embedded) is used as a label, the number of time series multiplies by the cardinality of that value. 1 million users = 1 million time series for a single metric. The TSDB runs out of memory; Prometheus crashes or becomes very slow. Paradoxically, adding more detailed observability can destroy the monitoring system.
How You'll See It¶
In Kubernetes¶
Ifuser_id has 1 million distinct values, this single metric creates 1 million time series.
Prometheus memory grows without bound. Scrapes start timing out. Dashboards load slowly.
Eventually Prometheus OOMKills or becomes so slow it misses scrape intervals.
Prometheus's own metrics reveal the issue: prometheus_tsdb_head_series grows rapidly
after the label was introduced.
In Linux/Infrastructure¶
Statsd counter with a UUID-per-request tag. Every request creates a new metric name.
Graphite's whisper database fills with millions of tiny files (one per metric). df -i
shows inode exhaustion (FP-001) from metric files.
In CI/CD¶
CI metrics with build_id or commit_sha as a label. Every commit creates new time
series. Over 6 months of CI history, 50,000 builds × N metrics per build = millions
of time series. Prometheus can't serve dashboards in under 10 seconds.
The Tell¶
prometheus_tsdb_head_seriesgrows rapidly after a code/config change. Prometheus memory grows without bound; OOMKilled or increasingly slow. A specific metric introduced recently has a label with many distinct values.topk(10, count by(__name__)({__name__=~".+"}))shows a specific metric with millions of series.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Prometheus hardware insufficient | Cardinality explosion | Prometheus was fine before a specific code change; series count grew rapidly after |
| Too many metrics (count) | Too many label values (cardinality) | Metric count is fine; count by(__name__) shows one metric with millions of values |
| Network or storage issue | TSDB overload | Prometheus metrics themselves show the memory/series explosion |
The Fix (Generic)¶
- Immediate: Drop the high-cardinality label via a Prometheus
metric_relabel_configsdrop rule; this reduces series count immediately. - Short-term: Remove the high-cardinality label from the metric; use low-cardinality labels only (status code, HTTP method, endpoint class — not specific path or user ID).
- Long-term: Establish a cardinality budget per metric (max 1,000 unique label combinations per metric); use
prometheus_tsdb_head_seriesas a dashboard panel; alert on series count growth rate.
Real-World Examples¶
- Example 1: Engineer added
user_idto auth metrics. 2M active users = 2M time series for that metric alone. Prometheus OOMKilled within 24 hours of the deploy. - Example 2: URL path used as a label:
/api/v1/users/123/orders→ each user ID creates a new path series. 500k users × 5 endpoints × 3 metrics = 7.5M series. Prometheus query timeout at 30s for any dashboard.
War Story¶
Monitoring went down on a Thursday. Page at 11pm: "Prometheus is down."
kubectl describe pod prometheus: OOMKilled, exit 137. We increased the memory limit (512Mi → 1Gi). Fine for 2 hours, then OOMKilled again.prometheus_tsdb_head_serieshad been growing at 10,000 series/minute. We traced it to a PR merged that afternoon: someone had addedrequest_id(UUID per request) as a label on HTTP duration histograms. Each request created 14 new time series (histogram buckets). At 700 req/s: 10,000 new series/second. We immediately deployed ametric_relabel_configsrule to drop therequest_idlabel. Series count stabilized. Prometheus memory dropped from 4GB to 200MB.
Cross-References¶
- Topic Packs: observability-deep-dive, k8s-ops
- Footguns: observability-deep-dive/footguns.md — "Cardinality explosion"
- Related Patterns: FP-001 (inode exhaustion — same pattern in Graphite/filesystem-based metrics)