Skip to content

Portal | Level: L2: Operations | Topics: Kubernetes Core, Probes (Liveness/Readiness) | Domain: Kubernetes

Scenario: Deployment Stuck Progressing

The Prompt

"We deployed a new version of our application 20 minutes ago. The deployment shows Progressing but never reaches Available. Users are still on the old version. What do you do?"

Initial Report

On-call page: "Deploy of grokdevops v2.4.1 has been stuck for 20 minutes. Users are still hitting v2.4.0. No errors on the old version but the new pods never become Ready."

Constraints

  • Time pressure: You have 15 minutes before the next escalation to the VP of Engineering.
  • Limited access: You have read-only access to the production cluster; any rollback requires a senior SRE to approve. You cannot SSH into nodes directly.

Observable Evidence

  • Dashboard: Deployment replica count shows desired=3, ready=2, updated=3, available=2. Rollout progress bar has been stuck at 66%.
  • Pod status: New pods show 0/1 Ready with multiple restarts or stuck in ContainerCreating.
  • Logs: kubectl logs on new pods may show application startup errors, or may be empty if the container crashes before writing output. Events show Readiness probe failed: HTTP probe failed with statuscode: 503.

Expected Investigation Path

# 1. Check deployment status
kubectl get deploy -n grokdevops
kubectl rollout status deployment/grokdevops -n grokdevops

# 2. Check pod status — look for pods stuck in 0/1 Ready
kubectl get pods -n grokdevops

# 3. Describe the new pods — look for probe failures
kubectl describe pod -n grokdevops -l app.kubernetes.io/name=grokdevops | grep -A10 Events

# 4. Check readiness probe config
kubectl get deploy grokdevops -n grokdevops -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}' | python3 -m json.tool

# 5. Test the probe endpoint manually
kubectl exec -n grokdevops <new-pod> -- wget -qO- --timeout=3 http://localhost:8000/health

Strong Answer

"First, I'd check the deployment status and see if new ReplicaSet pods are being created. If they exist but aren't becoming Ready, I'd describe them to check events. The most common cause is a readiness probe failure — either the endpoint changed, the app is taking too long to start, or there's a config issue preventing startup. I'd test the probe endpoint directly from inside the pod. If the probe config is wrong, I'd either patch it or roll back. If startup is slow, I'd adjust initialDelaySeconds. Throughout, I'd check the rollout strategy — if it's RollingUpdate, old pods stay serving traffic, so this is degraded but not an outage."

Common Traps

  • Jumping to rollback without diagnosis — you should understand WHY before reverting
  • Ignoring the rollout strategy — knowing if old pods are still serving matters
  • Forgetting --previous logs — if the pod crashed before becoming ready, current logs may be empty
  • Not checking if the image exists — could be ImagePullBackOff, not a probe issue
  • Runtime lab: training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/
  • Runbook: training/library/runbooks/kubernetes/readiness_probe_failed.md
  • Drills: training/library/drills/kubectl_drills.md — Drill 4 (probe config), Drill 8 (rollout status)

Wiki Navigation

Next Steps