Skip to content

Runbook: Deployment Stuck / Rollout Stalled

Field Value
Domain Kubernetes
Alert kube_deployment_status_replicas_available < kube_deployment_status_replicas_desired for >10 min
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get deployments -n <NAMESPACE> -o wide
If output shows: multiple deployments stuck → This may be a quota or node capacity problem — check kubectl describe namespace <NAMESPACE> for quota usage If output shows: a single deployment stuck → Continue with steps below

Step 1: Check Rollout Status

Why: The rollout status command gives a human-readable summary of what the deployment controller is waiting for, which narrows the investigation immediately.

kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get replicasets -n <NAMESPACE> -l app=<APP_LABEL> --sort-by='.metadata.creationTimestamp'
Expected output (stuck rollout):
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "my-app" rollout to finish: 2 old replicas are pending termination...
Expected output (replicasets — look for the new one with 0 available):
NAME               DESIRED   CURRENT   READY   AGE
my-app-7d9f8b6c4   3         3         3       2d     <-- old, healthy
my-app-5c4a2b1d9   3         3         0       8m     <-- new, stuck
If this fails: Verify the deployment name with kubectl get deployments -n <NAMESPACE>.

Step 2: Check New Pod Events

Why: The new pods spawned by the rollout will have events explaining why they are not becoming Ready.

# Get the pods from the new ReplicaSet
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
# Then describe one of the new (not-ready) pods
kubectl describe pod <NEW_POD_NAME> -n <NAMESPACE>
kubectl get events -n <NAMESPACE> --sort-by='.lastTimestamp' | tail -20
Expected output — look for the Events section:
Events:
  Warning  Failed     2m    kubelet  Failed to pull image "registry.example.com/my-app:v2.0.0": ...
  Warning  BackOff    90s   kubelet  Back-off pulling image "registry.example.com/my-app:v2.0.0"
If output shows image pull errors: Go to Step 3. If output shows Pending with no events: Go to Step 4 (quota) or check node capacity. If output shows readiness probe failing: The app started but is not healthy — check application logs with kubectl logs <NEW_POD_NAME> -n <NAMESPACE>.

Step 3: Check Image Pull

Why: A bad image tag or missing registry credentials is one of the most common causes of stuck rollouts and is trivially fixed.

# Check the image being pulled
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[*].image}'

# Check if the imagePullSecret exists
kubectl get secrets -n <NAMESPACE> | grep registry

# Try to verify the image exists (if you have registry access)
docker manifest inspect <IMAGE>:<TAG>

# If secret is missing or wrong, re-create it
kubectl create secret docker-registry <SECRET_NAME> \
  --docker-server=<REGISTRY_URL> \
  --docker-username=<USERNAME> \
  --docker-password=<PASSWORD> \
  -n <NAMESPACE>
Expected output (image pull succeeds after fix): Pod events show Pulled and Started instead of Failed and BackOff. If this fails: The image tag genuinely does not exist in the registry — rollback (Step 6) and notify the developer.

Step 4: Check Resource Quotas

Why: Namespace resource quotas are a silent killer of rollouts. If the namespace has hit its CPU or memory quota, new pods will stay Pending indefinitely with no obvious error in pod events.

# Check namespace quota usage
kubectl describe namespace <NAMESPACE> | grep -A 20 "Resource Quotas"
# Or directly:
kubectl get resourcequota -n <NAMESPACE> -o yaml

# Check current resource usage vs limits
kubectl top pods -n <NAMESPACE>
Expected output (quota exceeded):
Resource Quotas
 Name:     default-quota
 Resource  Used   Hard
 --------  ---    ---
 cpu       3900m  4000m    <-- nearly full
 memory    7.8Gi  8Gi      <-- nearly full
If quota is the problem: Either increase the quota (requires cluster-admin) or reduce the replica count of the new deployment temporarily:
kubectl scale deployment <DEPLOYMENT_NAME> -n <NAMESPACE> --replicas=<LOWER_COUNT>
If this fails: Contact the platform team to increase the namespace quota.

Step 5: Check Pod Disruption Budgets

Why: A PodDisruptionBudget (PDB) that requires a minimum number of available pods can block a rolling update if the deployment cannot safely terminate old pods while staying above the minimum.

kubectl get pdb -n <NAMESPACE>
kubectl describe pdb <PDB_NAME> -n <NAMESPACE>
Expected output (PDB blocking):
Name:           my-app-pdb
Min Available:  3
Allowed Disruptions: 0    <-- 0 means the rollout cannot proceed
Status:
  Allowed Disruptions: 0
  Current:             3
  Desired:             3
  Total:               3
If Allowed Disruptions is 0: The PDB requires more replicas than are currently healthy. Options:
# Option A: Temporarily increase replicas to allow disruption budget to open
kubectl scale deployment <DEPLOYMENT_NAME> -n <NAMESPACE> --replicas=<CURRENT_PLUS_ONE>

# Option B: If the PDB is too strict for rolling updates, adjust maxUnavailable in the deployment
kubectl patch deployment <DEPLOYMENT_NAME> -n <NAMESPACE> \
  -p '{"spec":{"strategy":{"rollingUpdate":{"maxUnavailable":1}}}}'
If this fails: The PDB may be protecting a quorum-sensitive service (like a database). Do not bypass it without consulting the service owner.

Step 6: Rollback If Needed

Why: If the new version is broken (bad image, broken config, failing readiness probe), rolling forward is slower than rolling back to the last known-good version. Rollback is fast and safe.

# Check rollout history
kubectl rollout history deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

# Roll back to the previous version
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

# Roll back to a specific revision
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --to-revision=<REVISION_NUMBER>

# Watch rollback progress
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m
Expected output:
deployment.apps/<DEPLOYMENT_NAME> rolled back
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out
If rollback also stalls: The previous version may have a quota or image issue too — check Step 2-4 for the rolled-back pods.

Verification

# Confirm the issue is resolved
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL>
Success looks like: deployment "my-app" successfully rolled out and all pods show Running with READY at full count. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min SRE on-call "Kubernetes deployment stuck in , deployment , rollout stalled >30 min, runbook exhausted"
Data loss suspected Platform Lead "Data loss risk: stateful deployment rollout failed, possible data migration issue"
Scope expanding beyond namespace Platform team "Multi-namespace impact: multiple deployments stuck, possible cluster-wide resource exhaustion"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Notify the developer if a bad image or config triggered the incident
  • Review PDB and quota settings if they contributed to the incident

Common Mistakes

  1. Rolling forward when rollback is faster: When a new deployment version is broken, engineers sometimes spend 20 minutes trying to fix the new version in-place when a 30-second rollback would restore service. If the new version is clearly bad (broken image, failing health check), roll back first, restore service, then fix forward in a non-production environment.
  2. Not checking resource quotas: Pods stuck in Pending with no events are almost always a resource quota or node capacity issue. Engineers often check pod logs and image pull status (which are fine) while missing the quota exceeded error that is only visible via kubectl describe namespace or kubectl get resourcequota.
  3. Ignoring PodDisruptionBudgets: A rollout that appears to hang for no reason, where new pods are starting but old pods are not terminating, is usually blocked by a PDB. This is especially common in stateful services with strict availability requirements.

Cross-References


Wiki Navigation