Decision Tree: Deployment Is Stuck¶

Category: Incident Triage Starting Question: "A Kubernetes deployment is stuck — what's blocking it?" Estimated traversal: 2-5 minutes Domains: kubernetes, observability

The Tree¶

A Kubernetes deployment is stuck — what's blocking it?
│
├── Get rollout status: `kubectl rollout status deployment/<name> -n <namespace>`
│   │
│   ├── "Waiting for deployment rollout to finish: N of M updated replicas are available"
│   │   → New pods are not becoming Available. Check new pods below.
│   │
│   ├── "Waiting for deployment spec update to be observed"
│   │   → Deployment controller hasn't picked up the change yet
│   │   `kubectl get deployment <name> -w` — usually resolves in 5s
│   │   → If stuck >30s: ✅ ACTION: Check Deployment Controller / API Server Health
│   │
│   └── "deployment <name> successfully rolled out" but pods still old?
│       → May be a Helm/ArgoCD cache issue, not an actual stuck rollout
│       `kubectl get pods -l app=<name> -o jsonpath='{..image}'`
│       → ✅ ACTION: Verify Running Image Version
│
├── Check new pod status: `kubectl get pods -l app=<name> -n <namespace>`
│   │
│   ├── New pods are in Pending
│   │   │
│   │   ├── Check why: `kubectl describe pod <new-pod> | grep -A10 "Events:"`
│   │   │   │
│   │   │   ├── "Insufficient cpu" / "Insufficient memory"
│   │   │   │   `kubectl describe nodes | grep -A5 "Allocated resources"`
│   │   │   │   │
│   │   │   │   ├── All nodes are at capacity
│   │   │   │   │   └── ✅ ACTION: Scale Node Group / Reduce Pod Requests
│   │   │   │   │
│   │   │   │   └── Resource quota exceeded
│   │   │   │       `kubectl describe resourcequota -n <namespace>`
│   │   │   │       └── ✅ ACTION: Increase Resource Quota / Reduce Deployment Resources
│   │   │   │
│   │   │   └── "PodDisruptionBudget" or "not enough disruption budget"
│   │   │       → see PDB branch below
│   │   │
│   ├── New pods are in ImagePullBackOff
│   │   `kubectl describe pod <new-pod> | grep "Failed to pull\|Back-off pulling"`
│   │   │
│   │   ├── Image tag doesn't exist → typo in new image reference
│   │   │   └── ✅ ACTION: Fix Image Reference in Deployment
│   │   │
│   │   └── Authentication failure → image pull secret expired
│   │       └── ✅ ACTION: Rotate Image Pull Secret
│   │       Runbook: [imagepullbackoff.md](../../runbooks/kubernetes/imagepullbackoff.md)
│   │
│   ├── New pods are in CrashLoopBackOff
│   │   │   "Pods start and immediately crash — new version is broken."
│   │   │
│   │   ├── Check logs: `kubectl logs <new-pod> --previous`
│   │   │   │
│   │   │   ├── Application error / exception at startup
│   │   │   │   └── ✅ ACTION: Roll Back Deployment (new image is broken)
│   │   │   │
│   │   │   ├── Config / secret / env var missing
│   │   │   │   └── ✅ ACTION: Add Missing Config and Redeploy
│   │   │   │
│   │   │   └── Cannot connect to dependency (DB not yet running, etc.)
│   │   │       └── ✅ ACTION: Fix Dependency Health / Add Init Container
│   │   │
│   └── New pods are Running but not Ready (0/1 in READY column)
│       │   "Pod is up but failing readiness probe."
│       │
│       ├── Check readiness probe: `kubectl describe pod <new-pod> | grep -A10 "Readiness:"`
│       │   `kubectl describe pod <new-pod> | grep "Readiness probe failed"`
│       │   │
│       │   ├── Probe is checking wrong path or port
│       │   │   └── ✅ ACTION: Fix Readiness Probe in Deployment Spec
│       │   │
│       │   ├── App is slow to start and probe fires before ready
│       │   │   (especially after a large image change)
│       │   │   └── ✅ ACTION: Increase initialDelaySeconds in Readiness Probe
│       │   │
│       │   └── App is actually unhealthy (returns 5xx on health endpoint)
│       │       `kubectl exec -it <new-pod> -- curl -v localhost:<port><path>`
│       │       └── 5xx from health endpoint → fix app; or Roll Back
│       │
├── Check PodDisruptionBudgets
│   `kubectl get pdb -n <namespace>`
│   `kubectl describe pdb -n <namespace>`
│   │
│   ├── minAvailable is set high / maxUnavailable is 0
│   │   and old pods are not being evicted because new ones aren't ready
│   │   └── Deadlock: PDB prevents old pod removal, new pods can't start because readiness fails
│   │       │
│   │       ├── Is this a transient readiness issue?
│   │       │   → Fix readiness first; PDB will unblock
│   │       │
│   │       └── Is this a genuine PDB misconfiguration?
│   │           └── ✅ ACTION: Temporarily Relax PDB
│   │
│   └── PDB is satisfied → not the blocker
│
├── Check rollout strategy
│   `kubectl get deployment <name> -o yaml | grep -A5 "strategy:"`
│   │
│   ├── strategy: Recreate
│   │   → All old pods deleted before new pods start
│   │   → Downtime is expected. Wait for new pods.
│   │   → If new pods fail: ✅ ACTION: Roll Back Deployment
│   │
│   └── strategy: RollingUpdate
│       `maxSurge` and `maxUnavailable` values?
│       │
│       ├── maxUnavailable: 0 and maxSurge: 0 → impossible config (both 0 blocks rollout)
│       │   └── ✅ ACTION: Fix Rollout Strategy (set maxSurge ≥ 1)
│       │
│       └── maxSurge: 0 and no nodes have capacity for new pods → no room to surge
│           └── ✅ ACTION: Scale Node Group or Increase maxSurge
│
└── Check for broader blockers
    │
    ├── Admission webhook / policy blocking new pods?
    │   `kubectl describe pod <new-pod> | grep "Error from server\|admission webhook"`
    │   │
    │   └── Admission webhook rejecting → ✅ ACTION: Fix Policy Violation in New Pod Spec
    │       (see Kyverno / OPA Gatekeeper policy for what's failing)
    │
    ├── Resource quota in namespace
    │   `kubectl describe resourcequota -n <namespace>`
    │   │
    │   └── Hard quota exceeded → ✅ ACTION: Increase Resource Quota / Reduce Requests
    │
    └── Helm upgrade stuck (release in pending-upgrade state)?
        `helm status <release-name> -n <namespace>`
        │
        ├── STATUS: pending-upgrade → previous Helm operation in progress or failed
        │   └── ✅ ACTION: Fix Stuck Helm Release
        │   Runbook: [helm_upgrade_failed.md](../../runbooks/cicd/helm_upgrade_failed.md)
        │
        └── STATUS: failed → hooks or jobs failed
            `helm history <release>` — find the failed revision
            └── ✅ ACTION: Roll Back Helm Release

Node Details¶

Check 1: Rollout status message¶

Command: kubectl rollout status deployment/<name> -n <namespace> --timeout=2m What you're looking for: The exact message tells you what's blocking. "N of M updated replicas are available" means new pods aren't passing readiness. "Waiting for deployment spec update" is usually transient and self-resolves. Common pitfall: kubectl rollout status blocks until it times out or completes. Use a short --timeout in scripts, or just run it once to read the message then Ctrl-C.

Check 2: New pod events¶

Command: kubectl describe pod <new-pod-name> -n <namespace> — always look at the Events section at the bottom. Also: kubectl get events -n <namespace> --sort-by=.lastTimestamp | grep <deployment-name> | tail -20. What you're looking for: The scheduler's rejection reason, image pull failures, readiness probe failures, or admission webhook rejections. These are the five most common blockers. Common pitfall: Pod events expire after 1 hour (by default). If the deployment has been stuck for a long time, early events may be gone. Check kubectl describe while events are still present.

Check 3: Readiness probe¶

Command: kubectl get deployment <name> -o yaml | grep -A15 "readinessProbe:" and then test manually: kubectl exec -it <pod> -- curl -I localhost:<port><path>. What you're looking for: The probe's path, port, initialDelaySeconds, failureThreshold. A common issue is initialDelaySeconds too short for a slow-starting app (especially after adding migrations or startup processing). Common pitfall: Readiness probe failures are logged once per failure by default (configurable with periodSeconds). They show up in kubectl describe pod under "Events" as "Readiness probe failed: HTTP probe failed with statuscode: 500". The message includes the actual HTTP response body for debugging.

Check 4: PodDisruptionBudget deadlock¶

Command: kubectl describe pdb -n <namespace> — check Disruptions Allowed. If it shows 0, the PDB is blocking eviction of old pods. What you're looking for: A PDB deadlock occurs when: (1) the PDB requires N pods available, (2) new pods fail readiness (so old pods can't be evicted while maintaining N), and (3) the only way to get new pods ready requires evicting old ones. This requires fixing readiness first. Common pitfall: Setting minAvailable: 100% or maxUnavailable: 0 on a single-replica deployment means the deployment can never roll out without downtime. Use minAvailable: 0 or maxUnavailable: 1 for single-replica deployments.

Check 5: Resource quota¶

Command: kubectl describe resourcequota -n <namespace> — shows hard limits vs currently used. Also: kubectl get events -n <namespace> | grep "exceeded quota". What you're looking for: Any quota type where Used equals Hard — this blocks new pod creation. Common quota types: requests.cpu, requests.memory, limits.cpu, limits.memory, count/pods. Common pitfall: A deployment update that doesn't change the replica count can still hit quota if it increases the resource requests or limits per pod.

Check 6: Admission webhook¶

Command: kubectl describe pod <failing-new-pod> | grep -i "admission\|denied\|forbidden\|webhook" and kubectl get validatingwebhookconfigurations to list active webhooks. What you're looking for: Policy enforcement messages like "validation error: container must set CPU limits", "image must be from approved registry", or "label X is required". Common pitfall: If the admission webhook service itself is unavailable, ALL pod creation in the cluster may be blocked (not just your deployment). Check the webhook pod health: kubectl get pods -n kyverno or -n gatekeeper-system.

Terminal Actions¶

Action: Roll Back Deployment¶

Do: 1. kubectl rollout undo deployment/<name> -n <namespace> 2. Monitor: kubectl rollout status deployment/<name> -n <namespace> 3. Confirm old pods are back and healthy: kubectl get pods -l app=<name> 4. Verify service is healthy: check error rate in monitoring Verify: All pods in READY=1/1. Rollout history shows previous revision is now active. Runbook: helm_upgrade_failed.md

Action: Fix Readiness Probe in Deployment Spec¶

Do: 1. kubectl get deployment <name> -o yaml > /tmp/deploy-backup.yaml 2. Edit: kubectl edit deployment <name> -n <namespace> 3. Under readinessProbe, correct path, port, or increase initialDelaySeconds (e.g., from 5 to 30) 4. Save — this triggers a new rollout automatically Verify: New pods show READY=1/1 within their initialDelaySeconds + periodSeconds * failureThreshold window. Runbook: readiness_probe_failed.md

Action: Fix Image Reference in Deployment¶

Do: 1. Find the correct image tag: check your container registry or CI/CD pipeline output 2. kubectl set image deployment/<name> <container>=<registry>/<image>:<correct-tag> -n <namespace> 3. Watch: kubectl rollout status deployment/<name> Verify: kubectl describe pod <new-pod> | grep "Image:" shows the correct tag. Pod starts successfully.

Action: Temporarily Relax PDB¶

Do: 1. Document the current PDB: kubectl get pdb <name> -n <namespace> -o yaml > /tmp/pdb-backup.yaml 2. Patch to allow disruption: kubectl patch pdb <name> -n <namespace> -p '{"spec":{"maxUnavailable":1}}' 3. Wait for rollout to complete: kubectl rollout status deployment/<name> 4. Restore original PDB: kubectl apply -f /tmp/pdb-backup.yaml Verify: Rollout completes. Restore PDB immediately after.

Action: Fix Rollout Strategy¶

Do: 1. kubectl patch deployment <name> -n <namespace> --type merge -p '{"spec":{"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0}}}}' 2. This is the recommended default: always allow 1 extra pod to start before removing old ones Verify: kubectl rollout status deployment/<name> proceeds immediately.

Action: Increase Resource Quota / Reduce Deployment Resources¶

Do: 1. Check what's over quota: kubectl describe resourcequota -n <namespace> 2. Option A: Reduce requests in deployment: kubectl set resources deployment <name> --requests=cpu=100m,memory=128Mi 3. Option B: Increase quota (requires admin): kubectl edit resourcequota <name> -n <namespace> Verify: kubectl get events -n <namespace> | grep quota shows no new "exceeded quota" events.

Action: Fix Stuck Helm Release¶

Do: 1. Check status: helm status <release> -n <namespace> 2. If pending-upgrade: helm rollback <release> <previous-revision> -n <namespace> 3. If hooks are failing: helm upgrade --no-hooks <release> <chart> as a one-time fix 4. Check: helm history <release> — confirm status is deployed Verify: helm status <release> shows STATUS: deployed. Pods are healthy. Runbook: helm_upgrade_failed.md

Action: Fix Policy Violation in New Pod Spec¶

Do: 1. Identify the specific policy: kubectl describe pod <new-pod> | grep "admission webhook" 2. Check the policy definition: kubectl get clusterpolicy -A (Kyverno) or kubectl get constraint -A (OPA) 3. Fix the deployment spec to comply: add required labels, set CPU limits, use approved image registry 4. kubectl apply the fixed manifest Verify: Pod creation succeeds without admission webhook errors.

Escalation: Platform / Cluster Admin¶

When: API server is slow, deployment controller is unresponsive, or scheduler is not assigning pods. Who: Platform team / cluster administrator Include in page: Output of kubectl rollout status, kubectl get events, cluster name, namespace

Edge Cases¶

Deployment shows complete but Helm shows failed: Helm hooks (pre/post upgrade Jobs) may have failed even if the Deployment pods are healthy. Check helm history for the "failed" reason and whether the hook job is still running: kubectl get jobs -n <namespace>.
Rollout stuck at exactly one new pod: If maxUnavailable: 0 and maxSurge: 1, one new pod starts but the old pod won't be removed until the new pod is Ready. If the new pod never becomes Ready, the rollout is permanently stuck. The fix is to either fix readiness or roll back.
Deployment rolled out successfully but traffic is still hitting old pods: Service mesh (Istio) may have stale DestinationRule with subset selectors pointing to old pod labels. Check VirtualService and DestinationRule for version/subset selectors.
StatefulSet stuck (not Deployment): StatefulSet rollout is ordered — it will not proceed past a pod that fails readiness. Use kubectl rollout pause/resume statefulset to manually advance after fixing each pod.
Argo CD showing OutOfSync or Degraded: Argo CD may be applying a different manifest than you expect. Check argocd app get <name> for the diff. A stuck deployment in Argo CD may need argocd app sync --force.

Cross-References¶

Topic Packs: k8s-ops, k8s-pods-and-scheduling, k8s-ops (Probes), helm, k8s-debugging-playbook
Runbooks: crashloopbackoff.md, imagepullbackoff.md, helm_upgrade_failed.md, hpa_not_scaling.md