Portal | Level: L1: Foundations | Topics: Probes (Liveness/Readiness), Kubernetes Core | Domain: Kubernetes
Runbook: Readiness Probe Failed¶
Symptoms¶
- Pod is Running but not Ready (0/1)
- Service returns no endpoints
kubectl describe podshows "Readiness probe failed"- Rollout appears stuck
Fast Triage¶
kubectl get pods -n grokdevops -o wide
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A10 "Conditions\|Readiness"
kubectl get endpoints grokdevops -n grokdevops
kubectl logs -n grokdevops deploy/grokdevops
Likely Causes (ranked)¶
- Wrong probe path — endpoint doesn't exist or returns non-200
- App not listening on expected port — check containerPort vs probe port
- Slow startup — app needs more
initialDelaySeconds - Dependency not ready — app waiting for database/external service
Evidence Interpretation¶
What bad looks like:
- 0/1 Ready — the container is running but the readiness probe is failing, so Kubernetes removes it from Service endpoints. -kubectl get endpoints grokdevops -n grokdevops will show an empty or missing address list.
- During a rolling update, old pods keep serving traffic because new pods never become Ready; the rollout appears stuck.
- kubectl describe pod will show repeated "Readiness probe failed: …" events with the HTTP status or connection error.
Fix Steps¶
- Check what the probe is configured to hit:
- Test the endpoint manually:
- Fix the probe path in Helm values or patch:
Verification¶
kubectl get pods -n grokdevops # 1/1 Ready
kubectl get endpoints grokdevops -n grokdevops # has IP addresses
Cleanup¶
Redeploy from Helm to ensure persistent fix:
Unknown Unknowns¶
- Readiness and liveness probes do different things: readiness removes the pod from endpoints (no traffic), liveness restarts the container. Confusing them leads to wrong fixes.
- The Deployment has a
progressDeadlineSeconds(default 600s). If new pods aren't Ready within that window the rollout is marked failed — but old pods still serve. - Old ReplicaSet pods keep running and handling traffic during a stuck rollout; users may not notice the failure immediately.
Pitfalls¶
[!WARNING] Readiness and liveness probes serve opposite purposes. Readiness removes the pod from traffic; liveness restarts the container. Adding a liveness probe to "fix" a readiness failure causes restart loops and potential downtime instead of graceful traffic removal.
- Deleting pods — the Deployment recreates them with the same failing probe config. Fix the probe or the app first.
- Confusing readiness with liveness — adding a liveness probe to "fix" a readiness issue causes restart loops instead of graceful traffic removal.
- Not testing the endpoint from inside the pod — the probe runs inside the cluster network; test with
kubectl exec ... wget/curl, not from your laptop.
See Also¶
training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/training/interview-scenarios/01-deployment-stuck-progressing.mdtraining/interactive/incidents/scenarios/readiness-probe-wrong-path.sh
Wiki Navigation¶
Related Content¶
- Interview: Deployment Stuck Progressing (Scenario, L2) — Kubernetes Core, Probes (Liveness/Readiness)
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Lab: Readiness Probe Failure (CLI) (Lab, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Skillcheck: Kubernetes (Assessment, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Track: Kubernetes Core (Reference, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Pages that link here¶
- Decision Tree: Deployment Is Stuck
- Decision Tree: Service Returning 5xx Errors
- Kubernetes - Skill Check
- Kubernetes Ops Domain
- Kubernetes_Core
- Level 3: Production Kubernetes
- Operational Runbooks
- Scenario: Deployment Stuck Progressing
- Solution: Lab Runtime 01 -- Readiness Probe Failure
- Track: Incident Response