Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning¶
Domains: observability | networking | kubernetes_ops Level: L2 Estimated time: 30-45 min
Initial Alert¶
Alertmanager fires 47 alerts in 3 minutes starting at 09:30 UTC:
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-g5h6i (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
... (repeating)
WARNING: AlertManagerAlertsFiring — 47 active alerts in prod
CRITICAL: SLO — catalog-service availability dropped to 94.2% (target: 99.9%)
Observable Symptoms¶
- 47 alerts firing and resolving in rapid succession for
catalog-servicepods. - All 3 replicas are cycling between Ready and NotReady states every 30-60 seconds.
kubectl get pods -wshows containers being restarted:RESTARTScount climbing by 2-3 per minute.- Application logs show no errors — the catalog service is processing requests successfully between restarts.
- End users report intermittent "502 Bad Gateway" errors (the service is briefly unavailable during each restart).
- No code or configuration changes to catalog-service in the last week.
- The node resource usage is normal (CPU 45%, memory 62%).
The Misleading Signal¶
An alert storm with pods cycling between Ready/NotReady looks like an observability configuration issue — maybe alert thresholds are too sensitive, Alertmanager grouping is wrong, or the alert routing is misconfigured. The volume of alerts makes it feel like a monitoring problem, not an application problem. Engineers start tuning Alertmanager routing rules and inhibition rules to suppress the noise, rather than investigating why the pods are actually flapping.