Portal | Level: L2: Operations | Topics: Kubernetes Core, Probes (Liveness/Readiness) | Domain: Kubernetes
Scenario: Deployment Stuck Progressing¶
The Prompt¶
"We deployed a new version of our application 20 minutes ago. The deployment shows
Progressingbut never reachesAvailable. Users are still on the old version. What do you do?"
Initial Report¶
On-call page: "Deploy of grokdevops v2.4.1 has been stuck for 20 minutes. Users are still hitting v2.4.0. No errors on the old version but the new pods never become Ready."
Constraints¶
- Time pressure: You have 15 minutes before the next escalation to the VP of Engineering.
- Limited access: You have read-only access to the production cluster; any rollback requires a senior SRE to approve. You cannot SSH into nodes directly.
Observable Evidence¶
- Dashboard: Deployment replica count shows desired=3, ready=2, updated=3, available=2. Rollout progress bar has been stuck at 66%.
- Pod status: New pods show
0/1 Readywith multiple restarts or stuck inContainerCreating. - Logs:
kubectl logson new pods may show application startup errors, or may be empty if the container crashes before writing output. Events showReadiness probe failed: HTTP probe failed with statuscode: 503.
Expected Investigation Path¶
# 1. Check deployment status
kubectl get deploy -n grokdevops
kubectl rollout status deployment/grokdevops -n grokdevops
# 2. Check pod status — look for pods stuck in 0/1 Ready
kubectl get pods -n grokdevops
# 3. Describe the new pods — look for probe failures
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A10 Events
# 4. Check readiness probe config
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}' | python3 -m json.tool
# 5. Test the probe endpoint manually
kubectl exec -n grokdevops <new-pod> -- wget -qO- --timeout=3 http://localhost:8000/health
Strong Answer¶
"First, I'd check the deployment status and see if new ReplicaSet pods are being created. If they exist but aren't becoming Ready, I'd describe them to check events. The most common cause is a readiness probe failure — either the endpoint changed, the app is taking too long to start, or there's a config issue preventing startup. I'd test the probe endpoint directly from inside the pod. If the probe config is wrong, I'd either patch it or roll back. If startup is slow, I'd adjust initialDelaySeconds. Throughout, I'd check the rollout strategy — if it's RollingUpdate, old pods stay serving traffic, so this is degraded but not an outage."
Common Traps¶
- Jumping to rollback without diagnosis — you should understand WHY before reverting
- Ignoring the rollout strategy — knowing if old pods are still serving matters
- Forgetting
--previouslogs — if the pod crashed before becoming ready, current logs may be empty - Not checking if the image exists — could be ImagePullBackOff, not a probe issue
Practice and Links¶
- Runtime lab:
training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/ - Runbook:
training/library/runbooks/kubernetes/readiness_probe_failed.md - Drills:
training/library/drills/kubectl_drills.md— Drill 4 (probe config), Drill 8 (rollout status)
Wiki Navigation¶
Next Steps¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2)
Related Content¶
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Lab: Readiness Probe Failure (CLI) (Lab, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Runbook: Readiness Probe Failed (Runbook, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Skillcheck: Kubernetes (Assessment, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Track: Kubernetes Core (Reference, L1) — Kubernetes Core, Probes (Liveness/Readiness)
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Pages that link here¶
- Adversarial Interview Gauntlet
- Interview Gauntlet: Deploy Succeeded but Old Version Visible
- Interview Gauntlet: Kubernetes or Simpler Orchestrator?
- Interview Gauntlet: Learning Something Quickly
- Interview Gauntlet: Monolith or Microservices?
- Interview Gauntlet: Pods Crash-Looping
- Interview Gauntlet: Your Approach to On-Call
- Interview Scenarios
- Kubernetes - Skill Check
- Kubernetes_Core
- Kustomize - Street-Level Ops
- Level 5: SRE & Incident Response
- Runbook: Readiness Probe Failed
- Solution: Lab Runtime 01 -- Readiness Probe Failure
- Track: Incident Response