Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning¶

Domains: observability | networking | kubernetes_ops Level: L2 Estimated time: 30-45 min

Initial Alert¶

Alertmanager fires 47 alerts in 3 minutes starting at 09:30 UTC:

CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-g5h6i (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-a1b2c (prod)
CRITICAL: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
RESOLVED: KubePodNotReady — catalog-service-7c6d5e4f3-d3e4f (prod)
... (repeating)
WARNING: AlertManagerAlertsFiring — 47 active alerts in prod
CRITICAL: SLO — catalog-service availability dropped to 94.2% (target: 99.9%)

Observable Symptoms¶

47 alerts firing and resolving in rapid succession for catalog-service pods.
All 3 replicas are cycling between Ready and NotReady states every 30-60 seconds.
kubectl get pods -w shows containers being restarted: RESTARTS count climbing by 2-3 per minute.
Application logs show no errors — the catalog service is processing requests successfully between restarts.
End users report intermittent "502 Bad Gateway" errors (the service is briefly unavailable during each restart).
No code or configuration changes to catalog-service in the last week.
The node resource usage is normal (CPU 45%, memory 62%).

The Misleading Signal¶

An alert storm with pods cycling between Ready/NotReady looks like an observability configuration issue — maybe alert thresholds are too sensitive, Alertmanager grouping is wrong, or the alert routing is misconfigured. The volume of alerts makes it feel like a monitoring problem, not an application problem. Engineers start tuning Alertmanager routing rules and inhibition rules to suppress the noise, rather than investigating why the pods are actually flapping.