Pattern: `rate()` Over Too-Short Window¶

ID: FP-044 Family: Observability Gap Frequency: Common Blast Radius: Monitoring system Detection Difficulty: Subtle

The Shape¶

Prometheus rate() function requires at least 2 data points in the range window to calculate a rate. With a 15-second scrape interval and a 30-second range window (rate(metric[30s])), only 2 data points are available. A single missed scrape or a brief spike leaves only 1 data point — rate() returns no data or NaN. The alert fires or doesn't fire based on scrape timing, not actual metric behavior. The result is noisy, unreliable alerting that erodes trust in the monitoring system.

How You'll See It¶

In Kubernetes¶

# Unreliable: only 2 data points in window for 15s scrape interval
rate(http_requests_total[30s])

# Better: 8 data points minimum
rate(http_requests_total[2m])

Alert based on [30s] fires sporadically when exactly one scrape is missed. The alert fires during normal operation (false positive) and misses real issues (false negative). Engineers learn to distrust the alert.

In Linux/Infrastructure¶

node_cpu_seconds_total scrapped every 10s. Alert uses rate(node_cpu_seconds_total[20s]). Only 2 data points per window. A 1-second CPU spike within the window causes rate() to appear extremely high; a spike between scrapes is completely invisible. The alert is both over-sensitive and under-sensitive.

In CI/CD¶

CI build metric scraped every 60s. Alert uses rate(build_failures[2m]). 2 data points at most. A single scrape miss (CI system briefly overloaded) causes the alert to show NaN. Dashboard shows "no data" intermittently for a critical build metric.

The Tell¶

Alert fires intermittently without any real issue occurring. Alert uses a range shorter than 4× the scrape interval. rate() returns NaN or no data even when the target is healthy. Prometheus evaluates the alert inconsistently across evaluations.

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Intermittent service issue	Noisy rate calculation	Issue disappears quickly; service metrics show no corresponding problem
Prometheus scrape issue	Too-short window	Prometheus `up` is 1; target is healthy; rate calculation is the issue
Real alert (intermittent)	False positive from short window	Increasing window to 5m stabilizes the alert; intermittent firings stop

The Fix (Generic)¶

Immediate: Increase the rate window to at least 4× the scrape interval: rate(metric[2m]) for a 15s scrape interval (8 data points minimum).
Short-term: Audit all rate() expressions in alerting rules; apply the 4× rule universally.
Long-term: Use irate() only for instant-rate (high-resolution spikes); use rate() with [5m] or [10m] for alerting (more stable signal); document the rationale in the alert annotations.

Real-World Examples¶

Example 1: Alert: rate(errors[30s]) > 0.1. Scrape interval: 15s. Alert fired 12 times in a week with no real error increase. Engineers muted it. A real error spike was missed because the alert was already muted.
Example 2: Dashboard using rate(requests[30s]) showed wildly fluctuating request rates — appeared to spike to 10× then drop to 0 in the next data point. The actual request rate was stable. Changing to [5m] showed a flat, expected line.

War Story¶

Our error rate alert had been muted for 2 weeks because "it keeps firing randomly." During that 2 weeks, a real error spike occurred (5% error rate for 20 minutes). Nobody was paged. We found out from a user report. Investigated the alert: rate(errors[30s]) with a 15s scrape interval. Two data points. If one scrape landed on a normal sample and one on a spike, the rate appeared massive. If a spike occurred exactly between two scrapes, it was invisible. We changed to rate(errors[5m]) > 0.05, un-muted the alert, and it's been firing only for real errors since.

Cross-References¶

Topic Packs: observability-deep-dive, alerting-rules
Footguns: observability-deep-dive/footguns.md — "rate() over too short window"
Related Patterns: FP-041 (alerting on restart — another alert reliability issue), FP-042 (missing absent alert — complementary monitoring gap)

Pattern: rate() Over Too-Short Window¶