Skip to content

Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning

Domains: observability | networking | kubernetes_ops Level: L2 Estimated time: 30-45 min

Initial Alert

Alertmanager fires 47 alerts in 3 minutes starting at 09:30 UTC:

CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-g5h6i (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
... (repeating)
WARNING: AlertManagerAlertsFiring — 47 active alerts in prod
CRITICAL: SLO — catalog-service availability dropped to 94.2% (target: 99.9%)

Observable Symptoms

  • 47 alerts firing and resolving in rapid succession for catalog-service pods.
  • All 3 replicas are cycling between Ready and NotReady states every 30-60 seconds.
  • kubectl get pods -w shows containers being restarted: RESTARTS count climbing by 2-3 per minute.
  • Application logs show no errors — the catalog service is processing requests successfully between restarts.
  • End users report intermittent "502 Bad Gateway" errors (the service is briefly unavailable during each restart).
  • No code or configuration changes to catalog-service in the last week.
  • The node resource usage is normal (CPU 45%, memory 62%).

The Misleading Signal

An alert storm with pods cycling between Ready/NotReady looks like an observability configuration issue — maybe alert thresholds are too sensitive, Alertmanager grouping is wrong, or the alert routing is misconfigured. The volume of alerts makes it feel like a monitoring problem, not an application problem. Engineers start tuning Alertmanager routing rules and inhibition rules to suppress the noise, rather than investigating why the pods are actually flapping.