Skip to content

Pattern: rate() Over Too-Short Window

ID: FP-044 Family: Observability Gap Frequency: Common Blast Radius: Monitoring system Detection Difficulty: Subtle

The Shape

Prometheus rate() function requires at least 2 data points in the range window to calculate a rate. With a 15-second scrape interval and a 30-second range window (rate(metric[30s])), only 2 data points are available. A single missed scrape or a brief spike leaves only 1 data point — rate() returns no data or NaN. The alert fires or doesn't fire based on scrape timing, not actual metric behavior. The result is noisy, unreliable alerting that erodes trust in the monitoring system.

How You'll See It

In Kubernetes

# Unreliable: only 2 data points in window for 15s scrape interval
rate(http_requests_total[30s])

# Better: 8 data points minimum
rate(http_requests_total[2m])
Alert based on [30s] fires sporadically when exactly one scrape is missed. The alert fires during normal operation (false positive) and misses real issues (false negative). Engineers learn to distrust the alert.

In Linux/Infrastructure

node_cpu_seconds_total scrapped every 10s. Alert uses rate(node_cpu_seconds_total[20s]). Only 2 data points per window. A 1-second CPU spike within the window causes rate() to appear extremely high; a spike between scrapes is completely invisible. The alert is both over-sensitive and under-sensitive.

In CI/CD

CI build metric scraped every 60s. Alert uses rate(build_failures[2m]). 2 data points at most. A single scrape miss (CI system briefly overloaded) causes the alert to show NaN. Dashboard shows "no data" intermittently for a critical build metric.

The Tell

Alert fires intermittently without any real issue occurring. Alert uses a range shorter than 4× the scrape interval. rate() returns NaN or no data even when the target is healthy. Prometheus evaluates the alert inconsistently across evaluations.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Intermittent service issue Noisy rate calculation Issue disappears quickly; service metrics show no corresponding problem
Prometheus scrape issue Too-short window Prometheus up is 1; target is healthy; rate calculation is the issue
Real alert (intermittent) False positive from short window Increasing window to 5m stabilizes the alert; intermittent firings stop

The Fix (Generic)

  1. Immediate: Increase the rate window to at least 4× the scrape interval: rate(metric[2m]) for a 15s scrape interval (8 data points minimum).
  2. Short-term: Audit all rate() expressions in alerting rules; apply the 4× rule universally.
  3. Long-term: Use irate() only for instant-rate (high-resolution spikes); use rate() with [5m] or [10m] for alerting (more stable signal); document the rationale in the alert annotations.

Real-World Examples

  • Example 1: Alert: rate(errors[30s]) > 0.1. Scrape interval: 15s. Alert fired 12 times in a week with no real error increase. Engineers muted it. A real error spike was missed because the alert was already muted.
  • Example 2: Dashboard using rate(requests[30s]) showed wildly fluctuating request rates — appeared to spike to 10× then drop to 0 in the next data point. The actual request rate was stable. Changing to [5m] showed a flat, expected line.

War Story

Our error rate alert had been muted for 2 weeks because "it keeps firing randomly." During that 2 weeks, a real error spike occurred (5% error rate for 20 minutes). Nobody was paged. We found out from a user report. Investigated the alert: rate(errors[30s]) with a 15s scrape interval. Two data points. If one scrape landed on a normal sample and one on a spike, the rate appeared massive. If a spike occurred exactly between two scrapes, it was invisible. We changed to rate(errors[5m]) > 0.05, un-muted the alert, and it's been firing only for real errors since.

Cross-References