Pattern: rate() Over Too-Short Window¶
ID: FP-044 Family: Observability Gap Frequency: Common Blast Radius: Monitoring system Detection Difficulty: Subtle
The Shape¶
Prometheus rate() function requires at least 2 data points in the range window to
calculate a rate. With a 15-second scrape interval and a 30-second range window
(rate(metric[30s])), only 2 data points are available. A single missed scrape or a
brief spike leaves only 1 data point — rate() returns no data or NaN. The alert
fires or doesn't fire based on scrape timing, not actual metric behavior. The result
is noisy, unreliable alerting that erodes trust in the monitoring system.
How You'll See It¶
In Kubernetes¶
# Unreliable: only 2 data points in window for 15s scrape interval
rate(http_requests_total[30s])
# Better: 8 data points minimum
rate(http_requests_total[2m])
[30s] fires sporadically when exactly one scrape is missed. The
alert fires during normal operation (false positive) and misses real issues (false
negative). Engineers learn to distrust the alert.
In Linux/Infrastructure¶
node_cpu_seconds_total scrapped every 10s. Alert uses rate(node_cpu_seconds_total[20s]).
Only 2 data points per window. A 1-second CPU spike within the window causes rate()
to appear extremely high; a spike between scrapes is completely invisible. The alert
is both over-sensitive and under-sensitive.
In CI/CD¶
CI build metric scraped every 60s. Alert uses rate(build_failures[2m]). 2 data points
at most. A single scrape miss (CI system briefly overloaded) causes the alert to show
NaN. Dashboard shows "no data" intermittently for a critical build metric.
The Tell¶
Alert fires intermittently without any real issue occurring. Alert uses a range shorter than 4× the scrape interval.
rate()returns NaN or no data even when the target is healthy. Prometheus evaluates the alert inconsistently across evaluations.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Intermittent service issue | Noisy rate calculation | Issue disappears quickly; service metrics show no corresponding problem |
| Prometheus scrape issue | Too-short window | Prometheus up is 1; target is healthy; rate calculation is the issue |
| Real alert (intermittent) | False positive from short window | Increasing window to 5m stabilizes the alert; intermittent firings stop |
The Fix (Generic)¶
- Immediate: Increase the rate window to at least 4× the scrape interval:
rate(metric[2m])for a 15s scrape interval (8 data points minimum). - Short-term: Audit all
rate()expressions in alerting rules; apply the 4× rule universally. - Long-term: Use
irate()only for instant-rate (high-resolution spikes); userate()with[5m]or[10m]for alerting (more stable signal); document the rationale in the alert annotations.
Real-World Examples¶
- Example 1: Alert:
rate(errors[30s]) > 0.1. Scrape interval: 15s. Alert fired 12 times in a week with no real error increase. Engineers muted it. A real error spike was missed because the alert was already muted. - Example 2: Dashboard using
rate(requests[30s])showed wildly fluctuating request rates — appeared to spike to 10× then drop to 0 in the next data point. The actual request rate was stable. Changing to[5m]showed a flat, expected line.
War Story¶
Our error rate alert had been muted for 2 weeks because "it keeps firing randomly." During that 2 weeks, a real error spike occurred (5% error rate for 20 minutes). Nobody was paged. We found out from a user report. Investigated the alert:
rate(errors[30s])with a 15s scrape interval. Two data points. If one scrape landed on a normal sample and one on a spike, the rate appeared massive. If a spike occurred exactly between two scrapes, it was invisible. We changed torate(errors[5m]) > 0.05, un-muted the alert, and it's been firing only for real errors since.
Cross-References¶
- Topic Packs: observability-deep-dive, alerting-rules
- Footguns: observability-deep-dive/footguns.md — "
rate()over too short window" - Related Patterns: FP-041 (alerting on restart — another alert reliability issue), FP-042 (missing absent alert — complementary monitoring gap)