Prometheus Deep Dive Footguns¶

Mistakes that cause alerting blind spots, query errors, storage blowouts, or false confidence in monitoring.

1. High cardinality labels explode your TSDB¶

You add a user_id label to your HTTP metrics. There are 200,000 active users. Each unique combination of method, handler, status, and user_id creates a separate time series. Your series count goes from 500 to 50 million overnight. Prometheus OOMs, queries time out, and your monitoring is down.

Fix: Never use unbounded values as labels. User IDs, request IDs, UUIDs, email addresses, IP addresses — none of these belong as Prometheus labels. If you need per-user metrics, use a different system (logs, analytics DB). Keep label value sets small and enumerable (HTTP methods, status codes, predefined handler paths).

Debug clue: Run curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]' to find which metrics have the most series. If a single metric name accounts for millions of series, check its labels. The Prometheus /tsdb-status page also shows this in the UI under "Head Stats."

2. rate() over too-short a range¶

You write rate(http_requests_total[30s]) with a 15-second scrape interval. rate() needs at least two samples in the range to compute a rate. With a 30-second window and 15-second scrape interval, you get exactly two samples — but if one scrape is delayed or missing, you get zero or one sample and the query returns nothing. Dashboards show gaps. Alerts fail to evaluate.

Fix: Use a range that covers at least 4x your scrape interval. For a 15-second scrape interval, use [1m] at minimum and [5m] for stable alerting. The rule of thumb: rate(metric[5m]) is the safe default for 15-second scrape intervals.

3. Counter resets causing rate spikes¶

A service restarts. The counter drops from 150,000 to 0. rate() handles this correctly — it detects the reset and compensates. But increase() over a very short window right at the reset boundary can produce slightly inaccurate results due to extrapolation. More commonly, you see a brief spike in irate() because irate() only uses the last two samples and the interpolation after a reset can overshoot.

Fix: Use rate() for alerting, not irate(). Use rate() with a range of at least [5m] to smooth over restarts. If you see anomalous spikes, check if the service restarted during that window. irate() is for dashboards where you want responsiveness; rate() is for alerts where you want stability.

4. Histogram bucket boundaries miss your SLO¶

You use the default histogram buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]. Your SLO is "99% of requests under 200ms." There is no bucket boundary at 200ms. The closest are 100ms and 250ms. histogram_quantile interpolates linearly between buckets, so your p99 calculation is an approximation that can be significantly off, especially if the distribution is not uniform between 100ms and 250ms.

Fix: Define custom bucket boundaries that include your SLO thresholds:

# Python prometheus_client
from prometheus_client import Histogram

request_latency = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'handler'],
    buckets=[.005, .01, .025, .05, .1, .15, .2, .25, .3, .5, 1, 2.5, 5, 10]
)

Add boundaries at 150ms, 200ms, and 250ms if your SLO is 200ms. The more buckets around your SLO threshold, the more accurate the percentile calculation.

5. Alerting on rate of a gauge¶

You have node_load1 (a gauge) and you write rate(node_load1[5m]). rate() is designed for counters — it assumes values only go up and handles resets. On a gauge, rate() produces misleading results because gauge values naturally go down. When load drops from 8 to 2, rate() may interpret this as a counter reset.

Fix: For gauges, use deriv() to calculate the rate of change, or use avg_over_time(), min_over_time(), max_over_time() for trend analysis. For alerting on gauge values, threshold directly: node_load1 > 10 or avg_over_time(node_load1[5m]) > 8.

6. Pushgateway stale metrics¶

You use the Pushgateway for a nightly batch job. The job pushes batch_job_success 1 when it succeeds. It runs at 2 AM. At 3 PM the next day, Prometheus still scrapes batch_job_success 1 from the Pushgateway because pushed metrics persist until explicitly deleted. The batch job fails the next night but never pushes a failure metric. The Pushgateway still reports the stale success from yesterday. Your monitoring shows green.

Fix: Push a timestamp metric (batch_job_last_success_timestamp) and alert when the timestamp is too old: time() - batch_job_last_success_timestamp > 90000 (25 hours). Or delete the metric group from the Pushgateway before each run and push fresh. Never use Pushgateway for long-running services — it is only for short-lived batch jobs.

7. Storage retention too short for SLO windows¶

Your SLO is measured over a 30-day window. Prometheus retention is set to 15 days. Your 30-day error budget query returns partial data, underreporting errors from the first 15 days. Your SLO dashboard shows 99.99% when the reality is 99.5%.

Fix: Set retention to at least 2x your longest SLO window. For 30-day SLOs, retain at least 60 days locally. Or use remote storage (Thanos, Mimir) for long-term retention and query the remote backend for SLO calculations.

# prometheus.yml or CLI flag
# --storage.tsdb.retention.time=60d
# --storage.tsdb.retention.size=100GB  (whichever hits first)

8. Federation scraping too many metrics¶

You set up federation to bring metrics from 20 cluster Prometheus instances into a global Prometheus. You use match[]={__name__=~".+"} to federate everything. Each cluster has 500,000 series. The global Prometheus now ingests 10 million series, queries are slow, and the global instance OOMs.

Fix: Federate only recording rules and aggregated metrics, not raw per-pod metrics. Use specific match[] patterns:

params:
  match[]:
    - 'job:http_requests_total:rate5m'
    - 'job:http_error_rate:ratio_5m'
    - 'up'

For full-fidelity cross-cluster queries, use Thanos or Mimir instead of federation.

9. Summary vs histogram: summaries cannot aggregate¶

You deploy 10 instances of your service. Each exposes rpc_duration_seconds{quantile="0.99"}. You try to compute the global p99 across all instances: avg(rpc_duration_seconds{quantile="0.99"}). This is mathematically wrong. The average of p99 values is not the p99 of the combined distribution. Two instances with p99 of 100ms and p99 of 900ms do not average to a meaningful global p99.

Fix: Use histograms instead of summaries when you need cross-instance aggregation. With histograms, you aggregate the bucket counts and then compute the quantile:

# Correct: aggregate histogram buckets, then compute quantile
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Wrong: average pre-computed quantiles
avg(rpc_duration_seconds{quantile="0.99"})

10. absent() not working as expected¶

You write an alert: absent(up{job="api-server"}). This fires when NO series with job="api-server" exists at all. But if you have 10 instances and 9 are down, up{job="api-server"} still returns 1 series (the surviving instance), so absent() does not fire. You lose 90% of your fleet and the absent-alert does not trigger.

Fix: absent() detects total disappearance, not partial loss. For "some targets are down," use count(up{job="api-server"} == 0) > 0 or up{job="api-server"} == 0 directly. For "the expected number of targets changed," use count(up{job="api-server"}) < 10. Use absent() only when you want to know if a metric has completely vanished.

11. Alertmanager routing matches all routes unexpectedly¶

Your routing config has continue: true on a route. This means the alert continues matching subsequent routes after the first match. If your catch-all route is receiver: default-slack, every alert also hits the default receiver in addition to its intended receiver. You get double notifications.

# Problematic config
routes:
  - match:
      severity: critical
    receiver: pagerduty
    continue: true        # <-- alert continues to next route
  - match:
      severity: critical
    receiver: slack-critical

Fix: Only use continue: true when you intentionally want an alert to hit multiple receivers (e.g., both PagerDuty and Slack for critical alerts). For mutually exclusive routing, omit continue (defaults to false). Test your routing with amtool config routes test.

12. Stale markers on target disappearance¶

A pod gets terminated. Prometheus marks all its series with a stale marker 5 minutes after the last successful scrape. Any alert expression evaluating those series now returns no data instead of the last value. If your alert relies on the metric existing (e.g., up == 0), it stops firing once the stale marker kicks in because the series is gone — not zero, gone.

Fix: Use absent() or absent_over_time() for "target completely disappeared" scenarios. For "target was recently up and is now gone," combine: absent(up{instance="x"}) or up{instance="x"} == 0. Understand that Prometheus intentionally removes stale series to prevent stale data from persisting in queries.

Under the hood: Stale markers were introduced in Prometheus 2.0 to solve the "stale data" problem from 1.x, where metrics from dead targets persisted with their last value indefinitely. The 5-minute staleness window is hardcoded in Prometheus (the lookbackDelta default). You can increase it with --query.lookback-delta but this trades correctness for delayed cleanup.