Solution: Lab Runtime 01 -- Readiness Probe Failure¶
SPOILER WARNING: Try to solve it yourself first. Use hints progressively.
Hint Ladder¶
Hint 1: The problem is with how Kubernetes determines if a pod is ready to receive traffic. What mechanism does K8s use for that?
Hint 2: Check the readiness probe configuration on the deployment. What HTTP path is it hitting? Does that path exist?
Hint 3: The readiness probe was changed to hit /nonexistent. K8s gets a non-200 response, so it marks the pod as not ready. Check with: kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}'
Hint 4: Fix the probe path back to /health. You can either patch the deployment directly or use ./fix.sh.
Minimal Solution¶
kubectl patch deployment grokdevops -n grokdevops --type=json \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/health"}]'
kubectl rollout status deployment/grokdevops -n grokdevops --timeout=120s
Explain¶
Symptom: New pods show 0/1 Running (not READY). Rollout stalls; old pods continue serving traffic.
Evidence: kubectl describe pod shows Readiness probe failed: HTTP probe failed with statuscode: 404. The readiness probe is configured to hit /nonexistent which returns 404.
Root cause: The readiness probe HTTP path was changed from /health to /nonexistent. K8s readiness probes determine whether a pod should receive traffic via the Service. When the probe fails, K8s removes the pod from the Service endpoints, so no traffic is routed to it. During a rolling update, this means new pods never become ready, and the rollout controller won't terminate old pods.
Key insight: Readiness probe failures don't restart pods (that's liveness probes). They only remove pods from service endpoints. This is why the pod shows Running but 0/1 Ready.
Prevent¶
- Pin readiness probe config in Helm values, not hardcoded in templates
- Add a CI check that validates probe endpoints exist in the app
- Use
helm upgrade --dry-runbefore applying to catch bad probe configs - Set
progressDeadlineSecondsto a reasonable value so stuck rollouts are detected quickly