Pattern: Missing absent() Alert¶
ID: FP-042 Family: Observability Gap Frequency: Common Blast Radius: Single Service (monitoring blind spot) Detection Difficulty: Actively Misleading
The Shape¶
Most Prometheus alerts are threshold-based: "alert if metric > N." These alerts require
the metric to exist. If the service crashes and stops emitting metrics, the metric
disappears from Prometheus entirely. Threshold-based alerts return "no data" (not
"firing"). The service is completely down; all alerts are silent. Without absent()
checks that detect metric disappearance, a complete service failure can go undetected
until users report it.
How You'll See It¶
In Kubernetes¶
Service crashes (OOMKill, image pull error, node failure). Prometheus scrapes the pod's
/metrics endpoint: connection refused. Prometheus records no data points for that
target (as opposed to a "0" value). Alert rule error_rate > 0.01 evaluates to "no
data" because there is no error rate to evaluate. Alert doesn't fire. Service is down;
monitoring shows nothing.
# Alert that doesn't fire when metrics disappear:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
# Alert that fires when metrics disappear:
- alert: ServiceDown
expr: absent(up{job="myapp"} == 1)
for: 2m
In Linux/Infrastructure¶
A Prometheus scrape target's up metric goes to 0 (scrape failed). No alerts check
up == 0. Prometheus's own up metric change is visible in dashboards but not in
alert rules. The service goes down; dashboards are red; pages are silent.
In CI/CD¶
A metrics exporter in a CI step crashes. No metrics are emitted. CI pipeline dashboard (Grafana) shows blank panels. Engineers interpret blank panels as "no data to display" (a Grafana display option) rather than "no metrics available." The exporter failure is invisible.
The Tell¶
Service is down. Grafana dashboards show blank panels (no data). Prometheus
up{job="myapp"}is 0 or the metric is absent. No alerts have fired despite the service being completely unavailable. Absence of metrics is the signal; threshold-based alerts cannot detect absence.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Alert system working correctly | Absent metric not monitored | Check Prometheus up metric; check if alerts fire when the target is removed |
| Service degraded but not down | Service completely down; metrics absent | No data is different from low-data; blank Grafana panel vs zero-valued panel |
| Prometheus issue | Target issue | Other targets scraping fine; only this target's up is 0 |
The Fix (Generic)¶
- Immediate: Add an alert on
absent(up{job="myapp"})orup{job="myapp"} == 0for all critical services. - Short-term: For every critical metric, add a companion
absent()alert; setfor: 2mto avoid alert flap on brief scrape failures. - Long-term: Establish a monitoring coverage checklist: every service must have at minimum an
upalert, a latency alert, and an error-rate alert — withabsent()guarding the scrape.
Real-World Examples¶
- Example 1: Payment service OOMKilled at 2am. Metrics disappeared. All threshold alerts: "no data." Service was down for 47 minutes before a user called support.
absent()alert would have fired within 2 minutes. - Example 2: Prometheus scrape configuration had a typo after a refactor: job name changed from
myapptomy-app. All existing alerts usedjob="myapp". Metrics still emitted but no alerts could fire (label mismatch). Discovered during an outage that should have paged.
War Story¶
We had 15 carefully crafted Prometheus alerts. Error rate alert. Latency alert. Saturation alert. The whole SRE book. Then our service died at 3am. We woke up to 200 user emails. Zero alerts had fired. The pod had OOMKilled; metrics were gone; all our alerts evaluated to "no data." We'd spent weeks tuning thresholds and zero time on "is the service even emitting metrics?" Added
absent()alerts for all critical jobs that same morning. Within a month, we'd caught 3 more undetected outages via absent alerts.
Cross-References¶
- Topic Packs: observability-deep-dive, alerting-rules
- Footguns: observability-deep-dive/footguns.md — "Missing absent() alert"
- Related Patterns: FP-041 (alerting on restart — complementary alert quality issue), FP-024 (health check lying — health check passes but service broken)