Skip to content

Pattern: Alerting on Restart (Not Root Cause)

ID: FP-041 Family: Observability Gap Frequency: Very Common Blast Radius: Single Service (alert fatigue) Detection Difficulty: Subtle

The Shape

A Kubernetes alert fires every time a pod restarts. Kubernetes is designed to restart pods automatically; a restart is Kubernetes doing its job. If the alert fires for every restart, it fires for routine OOMKills, minor memory spikes, and expected rolling restarts. On-call teams learn to ignore restart alerts. When a real problem causes excessive restarts, the alert has already been trained-away by false positives. Alert fatigue causes the genuine signal to be missed.

How You'll See It

In Kubernetes

# The alert that causes alert fatigue:
- alert: PodRestarted
  expr: kube_pod_container_status_restarts_total > 0
  for: 0m
This fires for every restart. A pod that restarts twice a day due to a minor config issue fires this alert 730 times/year. On-call stops reading restart alerts. A pod that enters CrashLoopBackOff (restarting every 30s) generates the same alert as a pod that restarted once for a legitimate OOMKill.

In Observability

The alert is calibrated for "something happened" not "something bad is happening now." A good restart alert would fire when restarts are occurring faster than expected or when a pod is in CrashLoopBackOff (a sign of a persistent problem, not a transient one).

The Tell

The alert fires multiple times per week (or day). Most firings are not actionable (expected behavior). Engineers have a habit of "acknowledging" the alert without investigating. When a real CrashLoopBackOff occurs, the alert was already present and ignored.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Normal alert Alert fatigue from low-quality signal Check: how often does this alert fire? How often does it require action?
CrashLoopBackOff alert working Restart alert that also catches CrashLoopBackOff The alert fires for both expected and unexpected restarts; no discrimination

The Fix (Generic)

  1. Immediate: Increase the threshold: alert on increase(kube_pod_container_status_restarts_total[1h]) > 3 (more than 3 restarts in an hour) rather than any restart.
  2. Short-term: Alert on CrashLoopBackOff state directly: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1 — this fires only when Kubernetes has determined the pod is in a crash loop, not for individual restarts.
  3. Long-term: Calibrate all alerts by measuring: alert-to-action ratio. If >50% of firings require no action, the alert threshold is too low. Review all alerts quarterly.

Real-World Examples

  • Example 1: Team with 50-pod service. Restart alert fired 8 times/day on average (routine memory spikes, rolling restarts). After 3 months, team had universally muted restart notifications. A real CrashLoopBackOff (misconfigured init container) went undetected for 45 minutes because "it's just the restart alert."
  • Example 2: On-call rotations where "silence the restart alert" was the first action of every shift change. The institutional knowledge about what the alert meant was lost. New on-call member paged at 3am for a routine restart; spent 1 hour investigating a non-issue.

War Story

Our restart alert fired 12 times between Friday 5pm and Monday 9am. All routine. Monday morning, a real CrashLoopBackOff: production pod restarting every 20 seconds. The alert had fired 3 more times over the weekend for this pod. Nobody checked; restart alerts were background noise. We discovered the CrashLoopBackOff manually when investigating a latency issue. Changed the alert to CrashLoopBackOff state detection. It fired 0 times over the next 3 months — and when it fired, we investigated every single one.

Cross-References