Skip to content

Decision Tree: Pod Won't Start

Category: Incident Triage Starting Question: "A pod is stuck and won't start — what state is it in?" Estimated traversal: 2-4 minutes Domains: kubernetes, linux-performance, observability


The Tree

A pod is stuck and won't start — what state is it in?
(kubectl get pod <name> -n <namespace>)

├── CrashLoopBackOff
│   │   "The container starts, runs, then exits repeatedly."
│   │
│   ├── Get exit code: `kubectl describe pod <pod> | grep "Exit Code"`
│   │   │
│   │   ├── Exit Code 0 (success but container exited)
│   │   │   └── Container has no foreground process — it's a job, not a server
             └──  ACTION: Fix Container Entrypoint / Use restartPolicy: OnFailure
            ├── Exit Code 1 (application error)
                  ├── Check logs: `kubectl logs <pod> --previous`
                        ├── Config error / missing env var
               └──  ACTION: Fix ConfigMap / Secret / Environment Variable
                        ├── Cannot connect to dependency on startup
               └──  ACTION: Fix Dependency or Add Init Container Check
                        └── Application panic / exception at startup
                └──  ACTION: Fix Application Bug / Roll Back Image
                  └── No previous logs (container died too fast)
             `kubectl get pod <pod> -o yaml | grep -A5 "lastState"`
             └── Use debug container: `kubectl debug -it <pod> --image=busybox --copy-to=debug-pod`
            ├── Exit Code 137 (OOMKilled  process killed by kernel)
                  ├── Check memory limit: `kubectl describe pod <pod> | grep -A2 "Limits"`
                  ├── Was limit too low?   ACTION: Increase Memory Limit
                  └── Was there a memory leak?   ACTION: Investigate and Patch Memory Leak
             Runbook: [oomkilled.md](../../runbooks/kubernetes/oom-kill.md)
            ├── Exit Code 139 (Segmentation fault)
         └── Compiled binary crashing  check for bad library, wrong arch image
             `kubectl describe pod | grep "Image:"`
             └── ⚠️ ESCALATION: Engage App Owner with core dump if available
            └── Exit Code 2 / 126 / 127 (command not found / permission denied)
          └── Entrypoint or command is wrong
              `kubectl get pod <pod> -o yaml | grep -A5 "command:"`
              └──  ACTION: Fix Container Command / Entrypoint
      ├── ImagePullBackOff / ErrImagePull
         "Kubernetes cannot pull the container image."
            ├── Check image name and tag
         `kubectl describe pod <pod> | grep "Image:"`
                  ├── Typo in image name or tag doesn't exist
│   │   │   │   └── ✅ ACTION: Fix Image Reference in Deployment
│   │   │   │
│   │   │   └── Image name looks correct → check pull credentials
│   │   │
│   │   ├── Check registry credentials
│   │   │   `kubectl get secret -n <namespace> | grep docker`
│   │   │   `kubectl describe pod | grep "Failed to pull image"`
│   │   │   │
│   │   │   ├── "unauthorized" in event → credentials expired or wrong
│   │   │   │   └── ✅ ACTION: Rotate / Recreate Image Pull Secret
│   │   │   │       Runbook: [imagepullbackoff.md](../../runbooks/kubernetes/imagepullbackoff.md)
│   │   │   │
│   │   │   └── Credentials look fine → check network access to registry
│   │   │
│   │   └── Can the node reach the registry?
│   │       `kubectl debug node/<node> -it --image=busybox -- curl -v https://<registry>`
│   │       │
│   │       ├── Connection refused / timeout → NetworkPolicy or firewall blocking egress
│   │       │   └── ✅ ACTION: Fix NetworkPolicy / Egress Rule
│   │       │       Runbook: [networkpolicy_block.md](../../runbooks/kubernetes/networkpolicy_block.md)
│   │       │
│   │       └── TLS error → certificate or proxy issue
│   │           └── ✅ ACTION: Fix Registry TLS / Configure HTTP Proxy
│   │
│   ├── Pending
│   │   │   "Pod has been accepted but not scheduled to a node."
│   │   │
│   │   ├── Check why: `kubectl describe pod <pod> | grep -A10 "Events:"`
│   │   │   │
│   │   │   ├── "Insufficient cpu" / "Insufficient memory"
│   │   │   │   │
│   │   │   │   ├── Check node capacity: `kubectl describe nodes | grep -A5 "Allocated"`
│   │   │   │   │
│   │   │   │   ├── All nodes full → ✅ ACTION: Scale Node Group / Reduce Pod Requests
│   │   │   │   │
│   │   │   │   └── Requests too high → ✅ ACTION: Tune Pod Resource Requests
│   │   │   │
│   │   │   ├── "did not match node selector" / "node affinity"
│   │   │   │   `kubectl get pod <pod> -o yaml | grep -A10 "affinity:"`
│   │   │   │   └── ✅ ACTION: Fix Node Selector or Affinity Rules
│   │   │   │
│   │   │   ├── "had taint that the pod did not tolerate"
│   │   │   │   `kubectl get nodes -o json | jq '.items[].spec.taints'`
            └──  ACTION: Add Toleration to Pod or Remove Taint from Node
                  └── "persistentvolumeclaim is not bound" / "unbound PVC"
             `kubectl get pvc -n <namespace>`
             └──  ACTION: Fix PVC / StorageClass
            └── Pending for >10 min with no Events?
          `kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -20`
          └── Check if scheduler is running: `kubectl get pods -n kube-system | grep scheduler`
              └── Scheduler down  ⚠️ ESCALATION: Platform / Cluster Admin
      ├── Init:Error / Init:CrashLoopBackOff
         "An init container is failing before the main container starts."
            ├── Identify which init container:
         `kubectl describe pod <pod> | grep -A3 "Init Containers:"`
                  └── Get its logs: `kubectl logs <pod> -c <init-container-name>`
                          ├── DB migration failing   ACTION: Fix Migration Script / DB Schema
                          ├── Waiting for dependency that never comes up
                └──  ACTION: Fix Dependency / Check Service DNS
                          └── Permission denied on volume
                 └──  ACTION: Fix Volume Permission / securityContext
         └── OOMKilled
          "Pod ran out of memory while running."
              ├── `kubectl describe pod <pod> | grep -i "OOM\|killed\|Reason"`
              ├── Was it a one-time spike?  Increase limit temporarily, monitor
              └── Is memory growing over time?  Memory leak
           `kubectl top pod <pod> -w`  watch it grow
           └──  ACTION: Investigate Memory Leak / Tune JVM Heap
               Runbook: [oomkilled.md](../../runbooks/kubernetes/oom-kill.md)

Node Details

Check 1: Pod status at a glance

Command: kubectl get pod <name> -n <namespace> -o wide and kubectl describe pod <name> -n <namespace> What you're looking for: STATUS column (CrashLoopBackOff, Pending, ImagePullBackOff, Init:0/1, OOMKilled), RESTARTS count, and the Events section at the bottom of describe output. Common pitfall: kubectl get pods truncates long status strings. Always use kubectl describe pod for the full picture, especially the Events section.

Check 2: Exit codes from CrashLoopBackOff

Command: kubectl describe pod <pod> | grep -A10 "Last State:" — shows the exit code of the previous container run. What you're looking for: Exit Code 1 = app error. Exit Code 137 = OOMKilled. Exit Code 139 = segfault. Exit Code 0 = container exited cleanly (wrong process type). Exit Codes 126/127 = command not found/permission. Common pitfall: The "Exit Code" shown in kubectl describe is the exit code of the previous run, not the current one. If the pod just started, you may see the current container's state as "Running" before it crashes again.

Check 3: Previous container logs

Command: kubectl logs <pod> --previous -n <namespace> and kubectl logs <pod> --previous -c <container> -n <namespace> for multi-container pods. What you're looking for: Error messages printed just before the process exited. Most apps log their fatal errors. Common pitfall: If the container died before the app could write logs, --previous returns empty. In that case, use kubectl debug to create a copy of the pod with an interactive shell and inspect the filesystem.

Check 4: ImagePullBackOff image inspection

Command: kubectl describe pod <pod> | grep "Failed to pull" — this shows the actual Docker registry error message including "not found", "unauthorized", "connection refused". What you're looking for: The exact error from the registry. "manifest unknown" = tag doesn't exist. "unauthorized" = auth failure. "connection refused" = network issue. Common pitfall: Tags like latest may have been overwritten in the registry. Always use immutable digest-pinned or semver-tagged images in production.

Check 5: Pending — scheduler Events

Command: kubectl describe pod <pod> | tail -30 — the Events section shows the scheduler's rejection reason with full detail. What you're looking for: "0/3 nodes are available: 3 Insufficient memory" or "no nodes matched node selector" or "had volume node affinity conflict". Common pitfall: Nodes that are NotReady or have taints won't appear as "available" in the scheduler message even if they have resources. Check kubectl get nodes — a NotReady node looks like a resource shortage.

Check 6: Init container logs

Command: kubectl logs <pod> -c <init-container-name> — you must specify the init container name with -c. List all containers: kubectl get pod <pod> -o jsonpath='{.spec.initContainers[*].name}'. What you're looking for: Init containers often run migrations, wait-for-service checks, or secret injection. Look for connection errors, SQL errors, or file permission errors. Common pitfall: Init containers run in sequence. If the first init container fails, subsequent ones never run. Check kubectl describe pod to see which init container is at Init:Error.


Terminal Actions

Action: Fix ConfigMap / Secret / Environment Variable

Do: 1. Identify missing or wrong env var from crash logs 2. Check what's currently set: kubectl exec -it <pod> -- env | grep <VAR_NAME> (if pod is briefly running) 3. Update ConfigMap: kubectl edit configmap <name> or apply updated manifest 4. For secrets: kubectl create secret generic <name> --from-literal=key=value --dry-run=client -o yaml | kubectl apply -f - 5. Rollout restart to pick up changes: kubectl rollout restart deployment/<name> Verify: Pod starts and stays Running. kubectl logs <pod> shows no config errors.

Action: Increase Memory Limit

Do: 1. Check current limit: kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources.limits.memory}' 2. Increase by 50%: kubectl set resources deployment <name> --limits=memory=1Gi 3. Watch pod restart: kubectl get pods -l app=<name> -w Verify: Pod stays running. Check kubectl describe pod <pod> — no OOMKilled in lastState. Runbook: oomkilled.md

Action: Rotate / Recreate Image Pull Secret

Do: 1. Generate new credentials with your registry (e.g., docker login, GCR service account key) 2. Delete old secret: kubectl delete secret regcred -n <namespace> 3. Create new: kubectl create secret docker-registry regcred --docker-server=<registry> --docker-username=<user> --docker-password=<token> -n <namespace> 4. Confirm pod spec references it: kubectl get deployment <name> -o yaml | grep imagePullSecrets Verify: kubectl describe pod <new-pod> shows "Successfully pulled image" event. Runbook: imagepullbackoff.md

Action: Fix Node Selector or Affinity Rules

Do: 1. Inspect pod affinity: kubectl get pod <pod> -o yaml | grep -A20 "affinity:" 2. Check node labels: kubectl get nodes --show-labels 3. If label is missing from nodes: kubectl label node <node-name> <key>=<value> 4. Or relax the affinity to preferredDuringSchedulingIgnoredDuringExecution Verify: Pod moves from Pending to Running. kubectl describe pod no longer shows scheduling failure.

Action: Fix PVC / StorageClass

Do: 1. kubectl get pvc -n <namespace> — check STATUS column (should be Bound) 2. kubectl describe pvc <name> — look for "no storage class" or "volume node affinity conflict" 3. If StorageClass missing: kubectl get sc — confirm the StorageClass referenced in the PVC exists 4. If topology conflict: delete the PVC and PV, let them be recreated on the correct node Verify: kubectl get pvc shows STATUS = Bound. Pod starts.

Action: Fix Migration Script / DB Schema

Do: 1. kubectl logs <pod> -c <init-container-name> — get the full migration error 2. Connect to DB and check current schema version 3. Fix migration script or mark failed migration as resolved 4. kubectl rollout restart deployment/<name> to retry init container Verify: kubectl describe pod <new-pod> shows init container completed successfully (exit 0).

Escalation: Platform / Cluster Admin

When: Kubernetes scheduler is not running, API server unreachable, or system pods are crashing. Who: Platform / cluster admin team Include in page: Output of kubectl get pods -n kube-system, kubectl get nodes, and any error from kubectl describe pod <failing-pod>

Escalation: Engage App Owner with Core Dump

When: Exit code 139 (segfault) — this requires developer investigation of the binary. Who: Application development team Include in page: Pod name, image tag, node name, dmesg | grep segfault output from the node


Edge Cases

  • Pod in Terminating state forever: A finalizer is blocking deletion. kubectl patch pod <pod> -p '{"metadata":{"finalizers":[]}}' --type=merge. Use only when the pod is truly stuck.
  • Pod runs fine locally but crashes in cluster: Usually a missing environment variable, missing secret, or different resource limits. Compare docker run env vs pod spec.
  • Multiple pods crash simultaneously after rolling update: New image is broken. Run kubectl rollout undo immediately rather than debugging individual pods.
  • CrashLoopBackOff with 5+ minute backoff: Kubernetes backs off exponentially (10s, 20s, 40s... up to 5 min). Use kubectl debug or temporarily set restartPolicy: Never to hold the pod for inspection.
  • Init container waits for service that isn't available: If using a wait-for-it pattern and the service is down, all pods will be stuck. Fix the dependency first, not the init container.

Cross-References