Skip to content

Pattern: Metric Cardinality Explosion

ID: FP-040 Family: Observability Gap Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Subtle

The Shape

Prometheus (and similar time-series databases) store one time series per unique combination of metric name + label values. When a high-cardinality value (user ID, request ID, URL path with IDs embedded) is used as a label, the number of time series multiplies by the cardinality of that value. 1 million users = 1 million time series for a single metric. The TSDB runs out of memory; Prometheus crashes or becomes very slow. Paradoxically, adding more detailed observability can destroy the monitoring system.

How You'll See It

In Kubernetes

http_requests_total{user_id="12345", path="/api/v1/users/12345/orders"} ...
If user_id has 1 million distinct values, this single metric creates 1 million time series. Prometheus memory grows without bound. Scrapes start timing out. Dashboards load slowly. Eventually Prometheus OOMKills or becomes so slow it misses scrape intervals.

Prometheus's own metrics reveal the issue: prometheus_tsdb_head_series grows rapidly after the label was introduced.

In Linux/Infrastructure

Statsd counter with a UUID-per-request tag. Every request creates a new metric name. Graphite's whisper database fills with millions of tiny files (one per metric). df -i shows inode exhaustion (FP-001) from metric files.

In CI/CD

CI metrics with build_id or commit_sha as a label. Every commit creates new time series. Over 6 months of CI history, 50,000 builds × N metrics per build = millions of time series. Prometheus can't serve dashboards in under 10 seconds.

The Tell

prometheus_tsdb_head_series grows rapidly after a code/config change. Prometheus memory grows without bound; OOMKilled or increasingly slow. A specific metric introduced recently has a label with many distinct values. topk(10, count by(__name__)({__name__=~".+"})) shows a specific metric with millions of series.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Prometheus hardware insufficient Cardinality explosion Prometheus was fine before a specific code change; series count grew rapidly after
Too many metrics (count) Too many label values (cardinality) Metric count is fine; count by(__name__) shows one metric with millions of values
Network or storage issue TSDB overload Prometheus metrics themselves show the memory/series explosion

The Fix (Generic)

  1. Immediate: Drop the high-cardinality label via a Prometheus metric_relabel_configs drop rule; this reduces series count immediately.
  2. Short-term: Remove the high-cardinality label from the metric; use low-cardinality labels only (status code, HTTP method, endpoint class — not specific path or user ID).
  3. Long-term: Establish a cardinality budget per metric (max 1,000 unique label combinations per metric); use prometheus_tsdb_head_series as a dashboard panel; alert on series count growth rate.

Real-World Examples

  • Example 1: Engineer added user_id to auth metrics. 2M active users = 2M time series for that metric alone. Prometheus OOMKilled within 24 hours of the deploy.
  • Example 2: URL path used as a label: /api/v1/users/123/orders → each user ID creates a new path series. 500k users × 5 endpoints × 3 metrics = 7.5M series. Prometheus query timeout at 30s for any dashboard.

War Story

Monitoring went down on a Thursday. Page at 11pm: "Prometheus is down." kubectl describe pod prometheus: OOMKilled, exit 137. We increased the memory limit (512Mi → 1Gi). Fine for 2 hours, then OOMKilled again. prometheus_tsdb_head_series had been growing at 10,000 series/minute. We traced it to a PR merged that afternoon: someone had added request_id (UUID per request) as a label on HTTP duration histograms. Each request created 14 new time series (histogram buckets). At 700 req/s: 10,000 new series/second. We immediately deployed a metric_relabel_configs rule to drop the request_id label. Series count stabilized. Prometheus memory dropped from 4GB to 200MB.

Cross-References