Portal | Level: L2: Operations | Topics: Kubernetes Core, Probes (Liveness/Readiness) | Domain: Kubernetes

Scenario: Deployment Stuck Progressing¶

The Prompt¶

"We deployed a new version of our application 20 minutes ago. The deployment shows Progressing but never reaches Available. Users are still on the old version. What do you do?"

Initial Report¶

On-call page: "Deploy of grokdevops v2.4.1 has been stuck for 20 minutes. Users are still hitting v2.4.0. No errors on the old version but the new pods never become Ready."

Constraints¶

Time pressure: You have 15 minutes before the next escalation to the VP of Engineering.
Limited access: You have read-only access to the production cluster; any rollback requires a senior SRE to approve. You cannot SSH into nodes directly.

Observable Evidence¶

Dashboard: Deployment replica count shows desired=3, ready=2, updated=3, available=2. Rollout progress bar has been stuck at 66%.
Pod status: New pods show 0/1 Ready with multiple restarts or stuck in ContainerCreating.
Logs: kubectl logs on new pods may show application startup errors, or may be empty if the container crashes before writing output. Events show Readiness probe failed: HTTP probe failed with statuscode: 503.

Expected Investigation Path¶

# 1. Check deployment status
kubectl get deploy -n grokdevops
kubectl rollout status deployment/grokdevops -n grokdevops

# 2. Check pod status — look for pods stuck in 0/1 Ready
kubectl get pods -n grokdevops

# 3. Describe the new pods — look for probe failures
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A10 Events

# 4. Check readiness probe config
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}' | python3 -m json.tool

# 5. Test the probe endpoint manually
kubectl exec -n grokdevops <new-pod> -- wget -qO- --timeout=3 http://localhost:8000/health

Strong Answer¶

"First, I'd check the deployment status and see if new ReplicaSet pods are being created. If they exist but aren't becoming Ready, I'd describe them to check events. The most common cause is a readiness probe failure — either the endpoint changed, the app is taking too long to start, or there's a config issue preventing startup. I'd test the probe endpoint directly from inside the pod. If the probe config is wrong, I'd either patch it or roll back. If startup is slow, I'd adjust initialDelaySeconds. Throughout, I'd check the rollout strategy — if it's RollingUpdate, old pods stay serving traffic, so this is degraded but not an outage."

Common Traps¶

Jumping to rollback without diagnosis — you should understand WHY before reverting
Ignoring the rollout strategy — knowing if old pods are still serving matters
Forgetting --previous logs — if the pod crashed before becoming ready, current logs may be empty
Not checking if the image exists — could be ImagePullBackOff, not a probe issue

Practice and Links¶

Runtime lab: training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/
Runbook: training/library/runbooks/kubernetes/readiness_probe_failed.md
Drills: training/library/drills/kubectl_drills.md — Drill 4 (probe config), Drill 8 (rollout status)

Next Steps¶

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2)

Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Kubernetes Core, Probes (Liveness/Readiness)
Lab: Readiness Probe Failure (CLI) (Lab, L1) — Kubernetes Core, Probes (Liveness/Readiness)
Runbook: Readiness Probe Failed (Runbook, L1) — Kubernetes Core, Probes (Liveness/Readiness)
Skillcheck: Kubernetes (Assessment, L1) — Kubernetes Core, Probes (Liveness/Readiness)
Track: Kubernetes Core (Reference, L1) — Kubernetes Core, Probes (Liveness/Readiness)
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core

Scenario: Deployment Stuck Progressing¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Common Traps¶

Practice and Links¶

Wiki Navigation¶

Next Steps¶

Pages that link here¶

Scenario: Deployment Stuck Progressing¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Common Traps¶

Practice and Links¶

Wiki Navigation¶

Next Steps¶

Related Content¶

Pages that link here¶