Skip to content

Remediation: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning

Immediate Fix (Kubernetes Ops — Domain C)

The fix is in the Kubernetes probe configuration and DNS settings.

Step 1: Increase the probe timeout (immediate relief)

$ kubectl patch deployment catalog-service -n prod --type=json \
    -p='[
      {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/timeoutSeconds","value":5},
      {"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/timeoutSeconds","value":5}
    ]'
deployment.apps/catalog-service patched

Step 2: Fix the DNS configuration

# Remove the extra search domains and reduce ndots
$ kubectl patch deployment catalog-service -n prod --type=json \
    -p='[
      {"op":"replace","path":"/spec/template/spec/dnsConfig","value":{
        "options":[{"name":"ndots","value":"2"}]
      }}
    ]'

Step 3: Fix the health check to not depend on external services

The liveness probe should check the process health, not dependency health. Move dependency checks to the readiness probe only:

# Correct probe design
livenessProbe:
  httpGet:
    path: /healthz     # Simple "am I alive" check
    port: 8080
  timeoutSeconds: 3
  periodSeconds: 15
  failureThreshold: 5
readinessProbe:
  httpGet:
    path: /ready       # Dependency check (database, cache, etc.)
    port: 8080
  timeoutSeconds: 5
  periodSeconds: 10
  failureThreshold: 3
$ kubectl apply -f devops/helm/grokdevops/templates/deployment.yaml -n prod

Verification

Domain A (Observability) — Alert storm resolved

# Check active alerts
$ curl -s http://alertmanager.monitoring:9093/api/v2/alerts | jq '[.[] | select(.labels.alertname=="KubePodNotReady")] | length'
0

# Check pod restart count (should stabilize)
$ kubectl get pods -n prod -l app=catalog-service
NAME                               READY   STATUS    RESTARTS   AGE
catalog-service-8d7e6f5a4-j9k8l   1/1     Running   0          5m
catalog-service-8d7e6f5a4-m2n1o   1/1     Running   0          5m
catalog-service-8d7e6f5a4-p4q3r   1/1     Running   0          5m

Domain B (Networking) — DNS latency reduced

$ kubectl exec catalog-service-8d7e6f5a4-j9k8l -n prod -- \
    curl -s -w "dns: %{time_namelookup}, total: %{time_total}\n" \
    -o /dev/null http://catalog-search.prod:9200/_cluster/health
dns: 0.004, total: 0.052

DNS resolution back to 4ms.

Domain C (Kubernetes) — Probes configured correctly

$ kubectl get deployment catalog-service -n prod -o jsonpath='{.spec.template.spec.containers[0].livenessProbe}' | jq .
{
  "httpGet": {"path": "/healthz", "port": 8080},
  "timeoutSeconds": 3,
  "periodSeconds": 15,
  "failureThreshold": 5
}

Prevention

  • Monitoring: Add a probe failure rate alert that fires before the pod reaches the failure threshold. Alert on kube_pod_container_status_restarts_total rate increase.
- alert: PodRestartRate
  expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
  for: 5m
  labels:
    severity: warning
  • Runbook: Liveness probes must never check external dependencies — they should only verify the process is alive. Dependency health belongs in readiness probes. All probes must have a timeout of at least 3 seconds.

  • Architecture: DNS configuration changes must be tested against all services' probe response times before deployment. Use FQDN with trailing dots in health checks to bypass search domain expansion.