- k8s
- l1
- runbook
- k8s-core --- Portal | Level: L1: Foundations | Topics: Kubernetes Core | Domain: Kubernetes
Runbook: Deployment Stuck / Rollout Stalled¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | kube_deployment_status_replicas_available < kube_deployment_status_replicas_desired for >10 min |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get deployments -n <NAMESPACE> -o wide
kubectl describe namespace <NAMESPACE> for quota usage
If output shows: a single deployment stuck → Continue with steps below
Step 1: Check Rollout Status¶
Why: The rollout status command gives a human-readable summary of what the deployment controller is waiting for, which narrows the investigation immediately.
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get replicasets -n <NAMESPACE> -l app=<APP_LABEL> --sort-by='.metadata.creationTimestamp'
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "my-app" rollout to finish: 2 old replicas are pending termination...
NAME DESIRED CURRENT READY AGE
my-app-7d9f8b6c4 3 3 3 2d <-- old, healthy
my-app-5c4a2b1d9 3 3 0 8m <-- new, stuck
kubectl get deployments -n <NAMESPACE>.
Step 2: Check New Pod Events¶
Why: The new pods spawned by the rollout will have events explaining why they are not becoming Ready.
# Get the pods from the new ReplicaSet
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
# Then describe one of the new (not-ready) pods
kubectl describe pod <NEW_POD_NAME> -n <NAMESPACE>
kubectl get events -n <NAMESPACE> --sort-by='.lastTimestamp' | tail -20
Events:
Warning Failed 2m kubelet Failed to pull image "registry.example.com/my-app:v2.0.0": ...
Warning BackOff 90s kubelet Back-off pulling image "registry.example.com/my-app:v2.0.0"
kubectl logs <NEW_POD_NAME> -n <NAMESPACE>.
Step 3: Check Image Pull¶
Why: A bad image tag or missing registry credentials is one of the most common causes of stuck rollouts and is trivially fixed.
# Check the image being pulled
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[*].image}'
# Check if the imagePullSecret exists
kubectl get secrets -n <NAMESPACE> | grep registry
# Try to verify the image exists (if you have registry access)
docker manifest inspect <IMAGE>:<TAG>
# If secret is missing or wrong, re-create it
kubectl create secret docker-registry <SECRET_NAME> \
--docker-server=<REGISTRY_URL> \
--docker-username=<USERNAME> \
--docker-password=<PASSWORD> \
-n <NAMESPACE>
Pulled and Started instead of Failed and BackOff.
If this fails: The image tag genuinely does not exist in the registry — rollback (Step 6) and notify the developer.
Step 4: Check Resource Quotas¶
Why: Namespace resource quotas are a silent killer of rollouts. If the namespace has hit its CPU or memory quota, new pods will stay Pending indefinitely with no obvious error in pod events.
# Check namespace quota usage
kubectl describe namespace <NAMESPACE> | grep -A 20 "Resource Quotas"
# Or directly:
kubectl get resourcequota -n <NAMESPACE> -o yaml
# Check current resource usage vs limits
kubectl top pods -n <NAMESPACE>
Resource Quotas
Name: default-quota
Resource Used Hard
-------- --- ---
cpu 3900m 4000m <-- nearly full
memory 7.8Gi 8Gi <-- nearly full
Step 5: Check Pod Disruption Budgets¶
Why: A PodDisruptionBudget (PDB) that requires a minimum number of available pods can block a rolling update if the deployment cannot safely terminate old pods while staying above the minimum.
Expected output (PDB blocking):Name: my-app-pdb
Min Available: 3
Allowed Disruptions: 0 <-- 0 means the rollout cannot proceed
Status:
Allowed Disruptions: 0
Current: 3
Desired: 3
Total: 3
# Option A: Temporarily increase replicas to allow disruption budget to open
kubectl scale deployment <DEPLOYMENT_NAME> -n <NAMESPACE> --replicas=<CURRENT_PLUS_ONE>
# Option B: If the PDB is too strict for rolling updates, adjust maxUnavailable in the deployment
kubectl patch deployment <DEPLOYMENT_NAME> -n <NAMESPACE> \
-p '{"spec":{"strategy":{"rollingUpdate":{"maxUnavailable":1}}}}'
Step 6: Rollback If Needed¶
Why: If the new version is broken (bad image, broken config, failing readiness probe), rolling forward is slower than rolling back to the last known-good version. Rollback is fast and safe.
# Check rollout history
kubectl rollout history deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
# Roll back to the previous version
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
# Roll back to a specific revision
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>
# Watch rollback progress
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m
deployment.apps/<DEPLOYMENT_NAME> rolled back
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out
Verification¶
# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
deployment "my-app" successfully rolled out and all pods show Running with READY at full count.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | SRE on-call | "Kubernetes deployment stuck in |
| Data loss suspected | Platform Lead | "Data loss risk: stateful deployment |
| Scope expanding beyond namespace | Platform team | "Multi-namespace impact: multiple deployments stuck, possible cluster-wide resource exhaustion" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Notify the developer if a bad image or config triggered the incident
- Review PDB and quota settings if they contributed to the incident
Common Mistakes¶
- Rolling forward when rollback is faster: When a new deployment version is broken, engineers sometimes spend 20 minutes trying to fix the new version in-place when a 30-second rollback would restore service. If the new version is clearly bad (broken image, failing health check), roll back first, restore service, then fix forward in a non-production environment.
- Not checking resource quotas: Pods stuck in Pending with no events are almost always a resource quota or node capacity issue. Engineers often check pod logs and image pull status (which are fine) while missing the quota exceeded error that is only visible via
kubectl describe namespaceorkubectl get resourcequota. - Ignoring PodDisruptionBudgets: A rollout that appears to hang for no reason, where new pods are starting but old pods are not terminating, is usually blocked by a PDB. This is especially common in stateful services with strict availability requirements.
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: pod-crashloop.md — if new pods start but immediately crash
- Related Runbook: pvc-pending.md — if new pods are pending due to unbound volumes
- Related Runbook: node-not-ready.md — if nodes are full or NotReady causing pending pods
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core