Remediation: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning¶
Immediate Fix (Kubernetes Ops — Domain C)¶
The fix is in the Kubernetes probe configuration and DNS settings.
Step 1: Increase the probe timeout (immediate relief)¶
$ kubectl patch deployment catalog-service -n prod --type=json \
-p='[
{"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/timeoutSeconds","value":5},
{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/timeoutSeconds","value":5}
]'
deployment.apps/catalog-service patched
Step 2: Fix the DNS configuration¶
# Remove the extra search domains and reduce ndots
$ kubectl patch deployment catalog-service -n prod --type=json \
-p='[
{"op":"replace","path":"/spec/template/spec/dnsConfig","value":{
"options":[{"name":"ndots","value":"2"}]
}}
]'
Step 3: Fix the health check to not depend on external services¶
The liveness probe should check the process health, not dependency health. Move dependency checks to the readiness probe only:
# Correct probe design
livenessProbe:
httpGet:
path: /healthz # Simple "am I alive" check
port: 8080
timeoutSeconds: 3
periodSeconds: 15
failureThreshold: 5
readinessProbe:
httpGet:
path: /ready # Dependency check (database, cache, etc.)
port: 8080
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3
Verification¶
Domain A (Observability) — Alert storm resolved¶
# Check active alerts
$ curl -s http://alertmanager.monitoring:9093/api/v2/alerts | jq '[.[] | select(.labels.alertname=="KubePodNotReady")] | length'
0
# Check pod restart count (should stabilize)
$ kubectl get pods -n prod -l app=catalog-service
NAME READY STATUS RESTARTS AGE
catalog-service-8d7e6f5a4-j9k8l 1/1 Running 0 5m
catalog-service-8d7e6f5a4-m2n1o 1/1 Running 0 5m
catalog-service-8d7e6f5a4-p4q3r 1/1 Running 0 5m
Domain B (Networking) — DNS latency reduced¶
$ kubectl exec catalog-service-8d7e6f5a4-j9k8l -n prod -- \
curl -s -w "dns: %{time_namelookup}, total: %{time_total}\n" \
-o /dev/null http://catalog-search.prod:9200/_cluster/health
dns: 0.004, total: 0.052
DNS resolution back to 4ms.
Domain C (Kubernetes) — Probes configured correctly¶
$ kubectl get deployment catalog-service -n prod -o jsonpath='{.spec.template.spec.containers[0].livenessProbe}' | jq .
{
"httpGet": {"path": "/healthz", "port": 8080},
"timeoutSeconds": 3,
"periodSeconds": 15,
"failureThreshold": 5
}
Prevention¶
- Monitoring: Add a probe failure rate alert that fires before the pod reaches the failure threshold. Alert on
kube_pod_container_status_restarts_totalrate increase.
- alert: PodRestartRate
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 5m
labels:
severity: warning
-
Runbook: Liveness probes must never check external dependencies — they should only verify the process is alive. Dependency health belongs in readiness probes. All probes must have a timeout of at least 3 seconds.
-
Architecture: DNS configuration changes must be tested against all services' probe response times before deployment. Use FQDN with trailing dots in health checks to bypass search domain expansion.