Decision Tree: Pod Won't Start¶
Category: Incident Triage Starting Question: "A pod is stuck and won't start — what state is it in?" Estimated traversal: 2-4 minutes Domains: kubernetes, linux-performance, observability
The Tree¶
A pod is stuck and won't start — what state is it in?
(kubectl get pod <name> -n <namespace>)
│
├── CrashLoopBackOff
│ │ "The container starts, runs, then exits repeatedly."
│ │
│ ├── Get exit code: `kubectl describe pod <pod> | grep "Exit Code"`
│ │ │
│ │ ├── Exit Code 0 (success but container exited)
│ │ │ └── Container has no foreground process — it's a job, not a server
│ │ │ └── ✅ ACTION: Fix Container Entrypoint / Use restartPolicy: OnFailure
│ │ │
│ │ ├── Exit Code 1 (application error)
│ │ │ │
│ │ │ ├── Check logs: `kubectl logs <pod> --previous`
│ │ │ │ │
│ │ │ │ ├── Config error / missing env var
│ │ │ │ │ └── ✅ ACTION: Fix ConfigMap / Secret / Environment Variable
│ │ │ │ │
│ │ │ │ ├── Cannot connect to dependency on startup
│ │ │ │ │ └── ✅ ACTION: Fix Dependency or Add Init Container Check
│ │ │ │ │
│ │ │ │ └── Application panic / exception at startup
│ │ │ │ └── ✅ ACTION: Fix Application Bug / Roll Back Image
│ │ │ │
│ │ │ └── No previous logs (container died too fast)
│ │ │ `kubectl get pod <pod> -o yaml | grep -A5 "lastState"`
│ │ │ └── Use debug container: `kubectl debug -it <pod> --image=busybox --copy-to=debug-pod`
│ │ │
│ │ ├── Exit Code 137 (OOMKilled — process killed by kernel)
│ │ │ │
│ │ │ ├── Check memory limit: `kubectl describe pod <pod> | grep -A2 "Limits"`
│ │ │ │
│ │ │ ├── Was limit too low? → ✅ ACTION: Increase Memory Limit
│ │ │ │
│ │ │ └── Was there a memory leak? → ✅ ACTION: Investigate and Patch Memory Leak
│ │ │ Runbook: [oomkilled.md](../../runbooks/kubernetes/oom-kill.md)
│ │ │
│ │ ├── Exit Code 139 (Segmentation fault)
│ │ │ └── Compiled binary crashing — check for bad library, wrong arch image
│ │ │ `kubectl describe pod | grep "Image:"`
│ │ │ └── ⚠️ ESCALATION: Engage App Owner with core dump if available
│ │ │
│ │ └── Exit Code 2 / 126 / 127 (command not found / permission denied)
│ │ └── Entrypoint or command is wrong
│ │ `kubectl get pod <pod> -o yaml | grep -A5 "command:"`
│ │ └── ✅ ACTION: Fix Container Command / Entrypoint
│ │
│ ├── ImagePullBackOff / ErrImagePull
│ │ │ "Kubernetes cannot pull the container image."
│ │ │
│ │ ├── Check image name and tag
│ │ │ `kubectl describe pod <pod> | grep "Image:"`
│ │ │ │
│ │ │ ├── Typo in image name or tag doesn't exist
│ │ │ │ └── ✅ ACTION: Fix Image Reference in Deployment
│ │ │ │
│ │ │ └── Image name looks correct → check pull credentials
│ │ │
│ │ ├── Check registry credentials
│ │ │ `kubectl get secret -n <namespace> | grep docker`
│ │ │ `kubectl describe pod | grep "Failed to pull image"`
│ │ │ │
│ │ │ ├── "unauthorized" in event → credentials expired or wrong
│ │ │ │ └── ✅ ACTION: Rotate / Recreate Image Pull Secret
│ │ │ │ Runbook: [imagepullbackoff.md](../../runbooks/kubernetes/imagepullbackoff.md)
│ │ │ │
│ │ │ └── Credentials look fine → check network access to registry
│ │ │
│ │ └── Can the node reach the registry?
│ │ `kubectl debug node/<node> -it --image=busybox -- curl -v https://<registry>`
│ │ │
│ │ ├── Connection refused / timeout → NetworkPolicy or firewall blocking egress
│ │ │ └── ✅ ACTION: Fix NetworkPolicy / Egress Rule
│ │ │ Runbook: [networkpolicy_block.md](../../runbooks/kubernetes/networkpolicy_block.md)
│ │ │
│ │ └── TLS error → certificate or proxy issue
│ │ └── ✅ ACTION: Fix Registry TLS / Configure HTTP Proxy
│ │
│ ├── Pending
│ │ │ "Pod has been accepted but not scheduled to a node."
│ │ │
│ │ ├── Check why: `kubectl describe pod <pod> | grep -A10 "Events:"`
│ │ │ │
│ │ │ ├── "Insufficient cpu" / "Insufficient memory"
│ │ │ │ │
│ │ │ │ ├── Check node capacity: `kubectl describe nodes | grep -A5 "Allocated"`
│ │ │ │ │
│ │ │ │ ├── All nodes full → ✅ ACTION: Scale Node Group / Reduce Pod Requests
│ │ │ │ │
│ │ │ │ └── Requests too high → ✅ ACTION: Tune Pod Resource Requests
│ │ │ │
│ │ │ ├── "did not match node selector" / "node affinity"
│ │ │ │ `kubectl get pod <pod> -o yaml | grep -A10 "affinity:"`
│ │ │ │ └── ✅ ACTION: Fix Node Selector or Affinity Rules
│ │ │ │
│ │ │ ├── "had taint that the pod did not tolerate"
│ │ │ │ `kubectl get nodes -o json | jq '.items[].spec.taints'`
│ │ │ │ └── ✅ ACTION: Add Toleration to Pod or Remove Taint from Node
│ │ │ │
│ │ │ └── "persistentvolumeclaim is not bound" / "unbound PVC"
│ │ │ `kubectl get pvc -n <namespace>`
│ │ │ └── ✅ ACTION: Fix PVC / StorageClass
│ │ │
│ │ └── Pending for >10 min with no Events?
│ │ `kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -20`
│ │ └── Check if scheduler is running: `kubectl get pods -n kube-system | grep scheduler`
│ │ └── Scheduler down → ⚠️ ESCALATION: Platform / Cluster Admin
│ │
│ ├── Init:Error / Init:CrashLoopBackOff
│ │ │ "An init container is failing before the main container starts."
│ │ │
│ │ ├── Identify which init container:
│ │ │ `kubectl describe pod <pod> | grep -A3 "Init Containers:"`
│ │ │ │
│ │ │ └── Get its logs: `kubectl logs <pod> -c <init-container-name>`
│ │ │ │
│ │ │ ├── DB migration failing → ✅ ACTION: Fix Migration Script / DB Schema
│ │ │ │
│ │ │ ├── Waiting for dependency that never comes up
│ │ │ │ └── ✅ ACTION: Fix Dependency / Check Service DNS
│ │ │ │
│ │ │ └── Permission denied on volume
│ │ │ └── ✅ ACTION: Fix Volume Permission / securityContext
│ │ │
│ └── OOMKilled
│ │ "Pod ran out of memory while running."
│ │
│ ├── `kubectl describe pod <pod> | grep -i "OOM\|killed\|Reason"`
│ │
│ ├── Was it a one-time spike? → Increase limit temporarily, monitor
│ │
│ └── Is memory growing over time? → Memory leak
│ `kubectl top pod <pod> -w` — watch it grow
│ └── ✅ ACTION: Investigate Memory Leak / Tune JVM Heap
│ Runbook: [oomkilled.md](../../runbooks/kubernetes/oom-kill.md)
Node Details¶
Check 1: Pod status at a glance¶
Command: kubectl get pod <name> -n <namespace> -o wide and kubectl describe pod <name> -n <namespace>
What you're looking for: STATUS column (CrashLoopBackOff, Pending, ImagePullBackOff, Init:0/1, OOMKilled), RESTARTS count, and the Events section at the bottom of describe output.
Common pitfall: kubectl get pods truncates long status strings. Always use kubectl describe pod for the full picture, especially the Events section.
Check 2: Exit codes from CrashLoopBackOff¶
Command: kubectl describe pod <pod> | grep -A10 "Last State:" — shows the exit code of the previous container run.
What you're looking for: Exit Code 1 = app error. Exit Code 137 = OOMKilled. Exit Code 139 = segfault. Exit Code 0 = container exited cleanly (wrong process type). Exit Codes 126/127 = command not found/permission.
Common pitfall: The "Exit Code" shown in kubectl describe is the exit code of the previous run, not the current one. If the pod just started, you may see the current container's state as "Running" before it crashes again.
Check 3: Previous container logs¶
Command: kubectl logs <pod> --previous -n <namespace> and kubectl logs <pod> --previous -c <container> -n <namespace> for multi-container pods.
What you're looking for: Error messages printed just before the process exited. Most apps log their fatal errors.
Common pitfall: If the container died before the app could write logs, --previous returns empty. In that case, use kubectl debug to create a copy of the pod with an interactive shell and inspect the filesystem.
Check 4: ImagePullBackOff image inspection¶
Command: kubectl describe pod <pod> | grep "Failed to pull" — this shows the actual Docker registry error message including "not found", "unauthorized", "connection refused".
What you're looking for: The exact error from the registry. "manifest unknown" = tag doesn't exist. "unauthorized" = auth failure. "connection refused" = network issue.
Common pitfall: Tags like latest may have been overwritten in the registry. Always use immutable digest-pinned or semver-tagged images in production.
Check 5: Pending — scheduler Events¶
Command: kubectl describe pod <pod> | tail -30 — the Events section shows the scheduler's rejection reason with full detail.
What you're looking for: "0/3 nodes are available: 3 Insufficient memory" or "no nodes matched node selector" or "had volume node affinity conflict".
Common pitfall: Nodes that are NotReady or have taints won't appear as "available" in the scheduler message even if they have resources. Check kubectl get nodes — a NotReady node looks like a resource shortage.
Check 6: Init container logs¶
Command: kubectl logs <pod> -c <init-container-name> — you must specify the init container name with -c. List all containers: kubectl get pod <pod> -o jsonpath='{.spec.initContainers[*].name}'.
What you're looking for: Init containers often run migrations, wait-for-service checks, or secret injection. Look for connection errors, SQL errors, or file permission errors.
Common pitfall: Init containers run in sequence. If the first init container fails, subsequent ones never run. Check kubectl describe pod to see which init container is at Init:Error.
Terminal Actions¶
Action: Fix ConfigMap / Secret / Environment Variable¶
Do:
1. Identify missing or wrong env var from crash logs
2. Check what's currently set: kubectl exec -it <pod> -- env | grep <VAR_NAME> (if pod is briefly running)
3. Update ConfigMap: kubectl edit configmap <name> or apply updated manifest
4. For secrets: kubectl create secret generic <name> --from-literal=key=value --dry-run=client -o yaml | kubectl apply -f -
5. Rollout restart to pick up changes: kubectl rollout restart deployment/<name>
Verify: Pod starts and stays Running. kubectl logs <pod> shows no config errors.
Action: Increase Memory Limit¶
Do:
1. Check current limit: kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
2. Increase by 50%: kubectl set resources deployment <name> --limits=memory=1Gi
3. Watch pod restart: kubectl get pods -l app=<name> -w
Verify: Pod stays running. Check kubectl describe pod <pod> — no OOMKilled in lastState.
Runbook: oomkilled.md
Action: Rotate / Recreate Image Pull Secret¶
Do:
1. Generate new credentials with your registry (e.g., docker login, GCR service account key)
2. Delete old secret: kubectl delete secret regcred -n <namespace>
3. Create new: kubectl create secret docker-registry regcred --docker-server=<registry> --docker-username=<user> --docker-password=<token> -n <namespace>
4. Confirm pod spec references it: kubectl get deployment <name> -o yaml | grep imagePullSecrets
Verify: kubectl describe pod <new-pod> shows "Successfully pulled image" event.
Runbook: imagepullbackoff.md
Action: Fix Node Selector or Affinity Rules¶
Do:
1. Inspect pod affinity: kubectl get pod <pod> -o yaml | grep -A20 "affinity:"
2. Check node labels: kubectl get nodes --show-labels
3. If label is missing from nodes: kubectl label node <node-name> <key>=<value>
4. Or relax the affinity to preferredDuringSchedulingIgnoredDuringExecution
Verify: Pod moves from Pending to Running. kubectl describe pod no longer shows scheduling failure.
Action: Fix PVC / StorageClass¶
Do:
1. kubectl get pvc -n <namespace> — check STATUS column (should be Bound)
2. kubectl describe pvc <name> — look for "no storage class" or "volume node affinity conflict"
3. If StorageClass missing: kubectl get sc — confirm the StorageClass referenced in the PVC exists
4. If topology conflict: delete the PVC and PV, let them be recreated on the correct node
Verify: kubectl get pvc shows STATUS = Bound. Pod starts.
Action: Fix Migration Script / DB Schema¶
Do:
1. kubectl logs <pod> -c <init-container-name> — get the full migration error
2. Connect to DB and check current schema version
3. Fix migration script or mark failed migration as resolved
4. kubectl rollout restart deployment/<name> to retry init container
Verify: kubectl describe pod <new-pod> shows init container completed successfully (exit 0).
Escalation: Platform / Cluster Admin¶
When: Kubernetes scheduler is not running, API server unreachable, or system pods are crashing.
Who: Platform / cluster admin team
Include in page: Output of kubectl get pods -n kube-system, kubectl get nodes, and any error from kubectl describe pod <failing-pod>
Escalation: Engage App Owner with Core Dump¶
When: Exit code 139 (segfault) — this requires developer investigation of the binary.
Who: Application development team
Include in page: Pod name, image tag, node name, dmesg | grep segfault output from the node
Edge Cases¶
- Pod in Terminating state forever: A finalizer is blocking deletion.
kubectl patch pod <pod> -p '{"metadata":{"finalizers":[]}}' --type=merge. Use only when the pod is truly stuck. - Pod runs fine locally but crashes in cluster: Usually a missing environment variable, missing secret, or different resource limits. Compare
docker runenv vs pod spec. - Multiple pods crash simultaneously after rolling update: New image is broken. Run
kubectl rollout undoimmediately rather than debugging individual pods. - CrashLoopBackOff with 5+ minute backoff: Kubernetes backs off exponentially (10s, 20s, 40s... up to 5 min). Use
kubectl debugor temporarily setrestartPolicy: Neverto hold the pod for inspection. - Init container waits for service that isn't available: If using a
wait-for-itpattern and the service is down, all pods will be stuck. Fix the dependency first, not the init container.
Cross-References¶
- Topic Packs: k8s-pods-and-scheduling, k8s-ops, k8s-debugging-playbook, k8s-ops (Probes)
- Runbooks: crashloopbackoff.md, imagepullbackoff.md, oomkilled.md, networkpolicy_block.md