k8s
l1
runbook
k8s-core --- Portal | Level: L1: Foundations | Topics: Kubernetes Core | Domain: Kubernetes

Runbook: Deployment Stuck / Rollout Stalled¶

Field	Value
Domain	Kubernetes
Alert	`kube_deployment_status_replicas_available < kube_deployment_status_replicas_desired` for >10 min
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get deployments -n <NAMESPACE> -o wide

If output shows: multiple deployments stuck → This may be a quota or node capacity problem — check kubectl describe namespace <NAMESPACE> for quota usage If output shows: a single deployment stuck → Continue with steps below

Step 1: Check Rollout Status¶

Why: The rollout status command gives a human-readable summary of what the deployment controller is waiting for, which narrows the investigation immediately.

kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get replicasets -n <NAMESPACE> -l app=<APP_LABEL> --sort-by='.metadata.creationTimestamp'

Expected output (stuck rollout):

Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "my-app" rollout to finish: 2 old replicas are pending termination...

Expected output (replicasets — look for the new one with 0 available):

NAME               DESIRED   CURRENT   READY   AGE
my-app-7d9f8b6c4   3         3         3       2d     <-- old, healthy
my-app-5c4a2b1d9   3         3         0       8m     <-- new, stuck

If this fails: Verify the deployment name with kubectl get deployments -n <NAMESPACE>.

Step 2: Check New Pod Events¶

Why: The new pods spawned by the rollout will have events explaining why they are not becoming Ready.

# Get the pods from the new ReplicaSet
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
# Then describe one of the new (not-ready) pods
kubectl describe pod <NEW_POD_NAME> -n <NAMESPACE>
kubectl get events -n <NAMESPACE> --sort-by='.lastTimestamp' | tail -20

Expected output — look for the Events section:

Events:
  Warning  Failed     2m    kubelet  Failed to pull image "registry.example.com/my-app:v2.0.0": ...
  Warning  BackOff    90s   kubelet  Back-off pulling image "registry.example.com/my-app:v2.0.0"

If output shows image pull errors: Go to Step 3. If output shows Pending with no events: Go to Step 4 (quota) or check node capacity. If output shows readiness probe failing: The app started but is not healthy — check application logs with kubectl logs <NEW_POD_NAME> -n <NAMESPACE>.

Step 3: Check Image Pull¶

Why: A bad image tag or missing registry credentials is one of the most common causes of stuck rollouts and is trivially fixed.

# Check the image being pulled
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[*].image}'

# Check if the imagePullSecret exists
kubectl get secrets -n <NAMESPACE> | grep registry

# Try to verify the image exists (if you have registry access)
docker manifest inspect <IMAGE>:<TAG>

# If secret is missing or wrong, re-create it
kubectl create secret docker-registry <SECRET_NAME> \
  --docker-server=<REGISTRY_URL> \
  --docker-username=<USERNAME> \
  --docker-password=<PASSWORD> \
  -n <NAMESPACE>

Expected output (image pull succeeds after fix): Pod events show Pulled and Started instead of Failed and BackOff. If this fails: The image tag genuinely does not exist in the registry — rollback (Step 6) and notify the developer.

Step 4: Check Resource Quotas¶

Why: Namespace resource quotas are a silent killer of rollouts. If the namespace has hit its CPU or memory quota, new pods will stay Pending indefinitely with no obvious error in pod events.

# Check namespace quota usage
kubectl describe namespace <NAMESPACE> | grep -A 20 "Resource Quotas"
# Or directly:
kubectl get resourcequota -n <NAMESPACE> -o yaml

# Check current resource usage vs limits
kubectl top pods -n <NAMESPACE>

Expected output (quota exceeded):

Resource Quotas
 Name:     default-quota
 Resource  Used   Hard
 --------  ---    ---
 cpu       3900m  4000m    <-- nearly full
 memory    7.8Gi  8Gi      <-- nearly full

If quota is the problem: Either increase the quota (requires cluster-admin) or reduce the replica count of the new deployment temporarily:

kubectl scale deployment <DEPLOYMENT_NAME> -n <NAMESPACE> --replicas=<LOWER_COUNT>

If this fails: Contact the platform team to increase the namespace quota.

Step 5: Check Pod Disruption Budgets¶

Why: A PodDisruptionBudget (PDB) that requires a minimum number of available pods can block a rolling update if the deployment cannot safely terminate old pods while staying above the minimum.

kubectl get pdb -n <NAMESPACE>
kubectl describe pdb <PDB_NAME> -n <NAMESPACE>

Expected output (PDB blocking):

Name:           my-app-pdb
Min Available:  3
Allowed Disruptions: 0    <-- 0 means the rollout cannot proceed
Status:
  Allowed Disruptions: 0
  Current:             3
  Desired:             3
  Total:               3

If Allowed Disruptions is 0: The PDB requires more replicas than are currently healthy. Options:

# Option A: Temporarily increase replicas to allow disruption budget to open
kubectl scale deployment <DEPLOYMENT_NAME> -n <NAMESPACE> --replicas=<CURRENT_PLUS_ONE>

# Option B: If the PDB is too strict for rolling updates, adjust maxUnavailable in the deployment
kubectl patch deployment <DEPLOYMENT_NAME> -n <NAMESPACE> \
  -p '{"spec":{"strategy":{"rollingUpdate":{"maxUnavailable":1}}}}'

If this fails: The PDB may be protecting a quorum-sensitive service (like a database). Do not bypass it without consulting the service owner.

Step 6: Rollback If Needed¶

Why: If the new version is broken (bad image, broken config, failing readiness probe), rolling forward is slower than rolling back to the last known-good version. Rollback is fast and safe.

# Check rollout history
kubectl rollout history deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

# Roll back to the previous version
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

# Roll back to a specific revision
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>

# Watch rollback progress
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m

Expected output:

deployment.apps/<DEPLOYMENT_NAME> rolled back
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out

If rollback also stalls: The previous version may have a quota or image issue too — check Step 2-4 for the rolled-back pods.

Verification¶

# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>

Success looks like: deployment "my-app" successfully rolled out and all pods show Running with READY at full count. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	SRE on-call	"Kubernetes deployment stuck in , deployment , rollout stalled >30 min, runbook exhausted"
Data loss suspected	Platform Lead	"Data loss risk: stateful deployment rollout failed, possible data migration issue"
Scope expanding beyond namespace	Platform team	"Multi-namespace impact: multiple deployments stuck, possible cluster-wide resource exhaustion"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Notify the developer if a bad image or config triggered the incident
Review PDB and quota settings if they contributed to the incident

Common Mistakes¶

Rolling forward when rollback is faster: When a new deployment version is broken, engineers sometimes spend 20 minutes trying to fix the new version in-place when a 30-second rollback would restore service. If the new version is clearly bad (broken image, failing health check), roll back first, restore service, then fix forward in a non-production environment.
Not checking resource quotas: Pods stuck in Pending with no events are almost always a resource quota or node capacity issue. Engineers often check pod logs and image pull status (which are fine) while missing the quota exceeded error that is only visible via kubectl describe namespace or kubectl get resourcequota.
Ignoring PodDisruptionBudgets: A rollout that appears to hang for no reason, where new pods are starting but old pods are not terminating, is usually blocked by a PDB. This is especially common in stateful services with strict availability requirements.

Cross-References¶

Survival Guide: On-Call Survival Guide (pocket card version)
Topic Pack: Kubernetes Topics (deep background)
Related Runbook: pod-crashloop.md — if new pods start but immediately crash
Related Runbook: pvc-pending.md — if new pods are pending due to unbound volumes
Related Runbook: node-not-ready.md — if nodes are full or NotReady causing pending pods

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Runbook: Deployment Stuck / Rollout Stalled¶

Quick Assessment (30 seconds)¶

Step 1: Check Rollout Status¶

Step 2: Check New Pod Events¶

Step 3: Check Image Pull¶

Step 4: Check Resource Quotas¶

Step 5: Check Pod Disruption Budgets¶

Step 6: Rollback If Needed¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Deployment Stuck / Rollout Stalled¶

Quick Assessment (30 seconds)¶

Step 1: Check Rollout Status¶

Step 2: Check New Pod Events¶

Step 3: Check Image Pull¶

Step 4: Check Resource Quotas¶

Step 5: Check Pod Disruption Budgets¶

Step 6: Rollback If Needed¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶