Pattern: Alerting on Restart (Not Root Cause)¶
ID: FP-041 Family: Observability Gap Frequency: Very Common Blast Radius: Single Service (alert fatigue) Detection Difficulty: Subtle
The Shape¶
A Kubernetes alert fires every time a pod restarts. Kubernetes is designed to restart pods automatically; a restart is Kubernetes doing its job. If the alert fires for every restart, it fires for routine OOMKills, minor memory spikes, and expected rolling restarts. On-call teams learn to ignore restart alerts. When a real problem causes excessive restarts, the alert has already been trained-away by false positives. Alert fatigue causes the genuine signal to be missed.
How You'll See It¶
In Kubernetes¶
# The alert that causes alert fatigue:
- alert: PodRestarted
expr: kube_pod_container_status_restarts_total > 0
for: 0m
In Observability¶
The alert is calibrated for "something happened" not "something bad is happening now." A good restart alert would fire when restarts are occurring faster than expected or when a pod is in CrashLoopBackOff (a sign of a persistent problem, not a transient one).
The Tell¶
The alert fires multiple times per week (or day). Most firings are not actionable (expected behavior). Engineers have a habit of "acknowledging" the alert without investigating. When a real CrashLoopBackOff occurs, the alert was already present and ignored.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Normal alert | Alert fatigue from low-quality signal | Check: how often does this alert fire? How often does it require action? |
| CrashLoopBackOff alert working | Restart alert that also catches CrashLoopBackOff | The alert fires for both expected and unexpected restarts; no discrimination |
The Fix (Generic)¶
- Immediate: Increase the threshold: alert on
increase(kube_pod_container_status_restarts_total[1h]) > 3(more than 3 restarts in an hour) rather than any restart. - Short-term: Alert on CrashLoopBackOff state directly:
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1— this fires only when Kubernetes has determined the pod is in a crash loop, not for individual restarts. - Long-term: Calibrate all alerts by measuring: alert-to-action ratio. If >50% of firings require no action, the alert threshold is too low. Review all alerts quarterly.
Real-World Examples¶
- Example 1: Team with 50-pod service. Restart alert fired 8 times/day on average (routine memory spikes, rolling restarts). After 3 months, team had universally muted restart notifications. A real CrashLoopBackOff (misconfigured init container) went undetected for 45 minutes because "it's just the restart alert."
- Example 2: On-call rotations where "silence the restart alert" was the first action of every shift change. The institutional knowledge about what the alert meant was lost. New on-call member paged at 3am for a routine restart; spent 1 hour investigating a non-issue.
War Story¶
Our restart alert fired 12 times between Friday 5pm and Monday 9am. All routine. Monday morning, a real CrashLoopBackOff: production pod restarting every 20 seconds. The alert had fired 3 more times over the weekend for this pod. Nobody checked; restart alerts were background noise. We discovered the CrashLoopBackOff manually when investigating a latency issue. Changed the alert to CrashLoopBackOff state detection. It fired 0 times over the next 3 months — and when it fired, we investigated every single one.
Cross-References¶
- Topic Packs: observability-deep-dive, alerting-rules
- Footguns: observability-deep-dive/footguns.md — "Alerting on pod restart"
- Related Patterns: FP-042 (missing absent alert — related alert quality issue), FP-044 (rate over short window — another alert calibration issue)