Investigation: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning¶

Phase 1: Observability Investigation (Dead End)¶

Check the alert configuration:

$ kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodNotReady"
    - alert: KubePodNotReady
      expr: |
        min_over_time(kube_pod_status_ready{condition="true"}[5m]) == 0
      for: 5m
      labels:
        severity: critical

The alert fires when a pod is not ready for 5 consecutive minutes. But pods are flapping — they are not ready continuously, they oscillate. The min_over_time over 5 minutes catches any period of unreadiness. This is correct alerting behavior — the pods genuinely are not ready periodically.

Check Alertmanager routing:

$ kubectl get secret alertmanager-config -n monitoring -o jsonpath='{.data.alertmanager\.yml}' | base64 -d
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: ['alertname', 'namespace', 'pod']

Alertmanager config looks reasonable. The problem is not alert noise — the pods are genuinely restarting. Check why:

$ kubectl describe pod catalog-service-7c6d5e4f3-a1b2c -n prod | grep -A20 "Liveness\|Readiness"
    Liveness:       http-get http://:8080/health delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    ...
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
    Restart Count:  14

The liveness probe is killing the pod. timeout=1s with failure=3 — if the /health endpoint takes more than 1 second to respond 3 times in a row (within 30 seconds), the container is killed.

Check the health endpoint response time:

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- \
    curl -s -w "\n%{time_total}" http://localhost:8080/health
{"status": "healthy"}
0.842

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- \
    curl -s -w "\n%{time_total}" http://localhost:8080/health
{"status": "healthy"}
1.204

The health endpoint is responding in 0.8-1.2 seconds. It is borderline on the 1-second timeout — sometimes passing, sometimes failing. But the service itself is working. Why is the health check slow?

The Pivot¶

Check what the health endpoint does:

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- \
    curl -s http://localhost:8080/health | jq .
{
  "status": "healthy",
  "dependencies": {
    "database": "ok",
    "cache": "ok",
    "catalog-search": "ok"
  }
}

The health endpoint checks three dependencies, including catalog-search. Check its latency:

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- \
    curl -s -w "\n%{time_total}" http://catalog-search.prod:9200/_cluster/health
{"cluster_name":"catalog","status":"green"...}
0.782

The Elasticsearch catalog-search cluster health check takes 780ms, which is most of the health check time. This used to be fast (<50ms) but slowed down. Check from another namespace:

$ kubectl run debug --rm -it --image=curlimages/curl -n monitoring -- \
    curl -s -w "\n%{time_total}" http://catalog-search.prod:9200/_cluster/health
{"cluster_name":"catalog","status":"green"...}
0.045

45ms from the monitoring namespace. 780ms from the prod namespace. The latency is network-specific — something is adding latency to cross-service calls within the prod namespace.

Phase 2: Networking Investigation (Root Cause)¶

Check if there is a network delay between services in the prod namespace:

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- \
    curl -s -w "dns: %{time_namelookup}, connect: %{time_connect}, total: %{time_total}\n" \
    -o /dev/null http://catalog-search.prod:9200/_cluster/health
dns: 0.723, connect: 0.724, total: 0.782

DNS resolution is taking 723ms. That is the source of the latency. Check the pod's DNS config:

$ kubectl exec catalog-service-7c6d5e4f3-a1b2c -n prod -- cat /etc/resolv.conf
nameserver 10.96.0.10
search prod.svc.cluster.local svc.cluster.local cluster.local internal.company.com company.com
options ndots:5

Wait — internal.company.com and company.com are extra search domains. These were added by a recent DNS policy change:

$ kubectl get deployment catalog-service -n prod -o jsonpath='{.spec.template.spec.dnsConfig}' | jq .
{
  "searches": ["internal.company.com", "company.com"],
  "options": [{"name": "ndots", "value": "5"}]
}

These extra search domains were added by a platform team change to enable resolution of internal corporate hostnames. With ndots:5, looking up catalog-search.prod (1 dot) causes the resolver to try each search domain first. With 2 extra search domains, each lookup attempts 7 domain suffixes before the real one, generating 14 DNS queries (A + AAAA each). Under load, this saturates the CoreDNS pods and adds 700ms+ of DNS latency.

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was an alert storm from pod readiness flapping (observability), the root cause was DNS latency from extra search domains slowing the health check (networking), and the fix requires tuning Kubernetes probe configuration and DNS settings (kubernetes_ops). This is common because: health check probes have tight timeouts. Any network latency in the health check path — including DNS resolution — can cause probes to fail intermittently. DNS configuration changes affect all service-to-service calls but are most visible when they push probe response times past the timeout threshold.

Root Cause¶

A platform team DNS configuration change added two corporate search domains to pods' /etc/resolv.conf. With ndots:5, short service names like catalog-search.prod trigger 7+ search domain attempts before the correct resolution. This added 700ms of DNS latency to the health check endpoint, which performs a dependency check on Elasticsearch. With a 1-second probe timeout, the health check intermittently fails, causing kubelet to kill and restart the pods in a rapid cycle.