Thinking Out Loud: Prometheus Deep Dive¶
A senior SRE's internal monologue while working through a real Prometheus issue. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
Prometheus is consuming 18GB of RAM and climbing. The OOM-kill threshold is 24GB. The alert dashboard shows high cardinality warnings. I need to find the cardinality bomb before Prometheus runs out of memory and takes our monitoring with it.
The Monologue¶
Prometheus memory growing — cardinality issue. Every unique combination of metric name + label values creates a time series. Too many time series = too much memory. Let me find what's exploding.
numSeries: 4,247,891. Over 4 million active time series. For a cluster our size (8 nodes, ~200 services), that's way too high. A healthy ratio is roughly 1,000-5,000 series per service. At 200 services, we should be around 200K-1M. 4.2M means something is generating series at a ridiculous rate.
Let me find the top metric names by series count.
curl -s http://prometheus:9090/api/v1/status/tsdb | jq -r '.data.seriesCountByMetricName[] | "\(.value)\t\(.name)"' | sort -rn | head -20
Top offender: http_request_duration_seconds_bucket with 2.1 million series. That's a histogram, so the bucket count matters. But 2.1M series for one histogram? Let me understand the cardinality.
Mental Model: Cardinality = Product of Label Values¶
Time series cardinality is the PRODUCT of all label value counts. A histogram with 10 buckets, across 50 endpoints, 200 pods, 5 HTTP methods, and 20 status codes = 10 * 50 * 200 * 5 * 20 = 10 million series. Each dimension you add multiplies the total. The "last label added" looks innocent but doubles or triples the total.
Let me check what labels this metric has and how many unique values each has.
curl -s 'http://prometheus:9090/api/v1/label/__name__/values' > /dev/null # warm up
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (endpoint)' | jq '.data.result | length'
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (pod)' | jq '.data.result | length'
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (le)' | jq '.data.result | length'
Endpoints: 847. Pods: 156. Buckets (le): 15. But 847 endpoints is crazy. Most services should expose 5-15 endpoints for monitoring. Let me see what these endpoints look like.
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket) by (endpoint)' | jq -r '.data.result[] | .metric.endpoint' | sort | head -30
/api/v1/users/12345, /api/v1/users/67890, /api/v1/users/11111... They're including the USER ID in the endpoint label. That's the cardinality bomb. Every unique user ID creates a new set of time series. With 15 histogram buckets per endpoint, per pod, that's unbounded growth.
This is the number one Prometheus cardinality mistake. Someone instrumented their HTTP middleware to use the raw URL path as the endpoint label instead of the route template. /api/v1/users/:id should be one endpoint, not thousands.
Mental Model: Labels Must Be Low-Cardinality¶
Prometheus labels should have a bounded, small set of values. Good labels: HTTP method (5 values), status code class (5 values), service name (bounded). Bad labels: user ID (unbounded), request ID (unbounded), timestamp (infinite), raw URL path (unbounded). If a label can have more than ~100 unique values, it probably shouldn't be a label.
Let me figure out which service is doing this so we can fix the instrumentation.
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket{endpoint=~"/api/v1/users/.*"}) by (job)' | jq '.data.result[] | {job: .metric.job, count: .value[1]}'
Job: user-service. 1.8 million series from one service. That's our culprit. Now, I need to stop the bleeding before fixing the instrumentation. I can either:
1. Add a metric_relabel_config in Prometheus to drop the high-cardinality metric from this target 2. Fix the instrumentation in the user-service 3. Both
Option 3. The relabeling is a 2-minute fix that stops the memory growth immediately. The instrumentation fix needs a code deploy.
# Check current Prometheus scrape config for user-service
kubectl get configmap prometheus-config -n monitoring -o yaml | grep -A 20 "user-service"
Let me add a metric_relabel_config to drop the high-cardinality series.
I'll add this under the user-service scrape config:
metric_relabel_configs:
- source_labels: [__name__, endpoint]
regex: 'http_request_duration_seconds_(bucket|count|sum);/api/v1/users/[0-9]+'
action: drop
This drops any histogram series where the endpoint label matches a user-ID path pattern. After Prometheus reloads, those series will stop being ingested.
Let me watch the memory.
curl -s http://prometheus:9090/api/v1/query?query=process_resident_memory_bytes | jq '.data.result[0].value[1]' | awk '{print $1/1024/1024/1024 " GB"}'
18.2GB. The dropped series won't free memory immediately — Prometheus keeps the head block in memory until the next compaction. But no new high-cardinality series will be ingested, so memory growth will stop.
Let me verify the drop is working.
curl -s 'http://prometheus:9090/api/v1/query?query=count(http_request_duration_seconds_bucket{job="user-service"})' | jq '.data.result[0].value[1]'
Series count for user-service is already dropping as the scrape intervals pass and dropped series aren't re-ingested. Good.
Now I need to tell the user-service team to fix their instrumentation. The HTTP middleware should use the route template, not the resolved path. In most frameworks, this is a one-line change.
I also want to set up a recording rule that alerts on cardinality growth, so this doesn't sneak up on us again.
groups:
- name: cardinality
rules:
- alert: HighCardinalityMetric
expr: count by (__name__) ({__name__=~".+"}) > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has {{ $value }} series"
That alerts if any single metric name has more than 50K series. On our cluster, no single metric should have more than 50K series if labels are properly bounded.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Not know how to find high-cardinality metrics | Use the TSDB status API to find series counts by metric name and label | The TSDB API is Prometheus's built-in diagnostic for cardinality |
| See "too many time series" and increase memory | Find and eliminate the cardinality bomb | More memory is a bandaid — unbounded cardinality will eventually exceed any limit |
| Fix the instrumentation code and wait for a deploy | Add metric_relabel_configs immediately to stop the bleeding, then fix the code | Relabeling is a 2-minute fix; code changes take hours or days to deploy |
| Not set up preventive alerting | Add a cardinality alert to catch future explosions early | Cardinality bombs can come from any team deploying any service — you need automated detection |
Key Heuristics Used¶
- Cardinality is Multiplicative: Labels multiply. A single high-cardinality label (user IDs, request IDs) in a metric with 15 histogram buckets creates unbounded series.
- Drop Before Fix: Use metric_relabel_configs to stop ingestion immediately, then fix the root cause (instrumentation). Don't wait for a code deploy while Prometheus approaches OOM.
- Labels Must Be Bounded: If a label value set can grow indefinitely, it should not be a Prometheus label. Use logs or traces for high-cardinality dimensions.
Cross-References¶
- Primer — Prometheus data model, time series, and how labels create cardinality
- Street Ops — TSDB status API, cardinality debugging, and metric_relabel_configs
- Footguns — Using raw URL paths as labels, not monitoring Prometheus's own resource usage, and histogram bucket explosion