- k8s
- l1
- runbook
- crashloop
- k8s-core --- Portal | Level: L1: Foundations | Topics: CrashLoopBackOff, Kubernetes Core | Domain: Kubernetes
Runbook: Pod CrashLoopBackOff¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | container_restart_rate > 5/min or pod status = CrashLoopBackOff |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -n <NAMESPACE> --field-selector=status.phase!=Running -o wide
Step 1: Check Pod Status and Restart Count¶
Why: Confirms the CrashLoopBackOff and shows how long it has been failing, which sets urgency.
kubectl get pod <POD_NAME> -n <NAMESPACE> -o wide
kubectl get events -n <NAMESPACE> --field-selector=involvedObject.name=<POD_NAME> --sort-by='.lastTimestamp'
kubectl config get-contexts and confirm the pod name with kubectl get pods -n <NAMESPACE>.
Step 2: Get Previous Container Logs¶
Why: Current logs often show nothing because the container crashed immediately. The -p flag retrieves the logs from the last terminated instance, which contains the actual error.
# The actual error — stack trace, panic message, missing file, failed config parse, etc.
Error: cannot load config: open /etc/app/config.yaml: no such file or directory
kubectl describe pod <POD_NAME> -n <NAMESPACE> for init container failures.
Step 3: Describe Pod for Exit Codes and Events¶
Why: The exit code tells you the category of failure. Exit code 1 = app crash, exit code 137 = OOMKilled (see oom-kill.md), exit code 126/127 = bad entrypoint/command.
Expected output — look for this section:Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 19 Mar 2026 10:00:00 +0000
Finished: Thu, 19 Mar 2026 10:00:02 +0000
spec.containers[].command and spec.containers[].args.
Step 4: Check for OOMKill vs Application Crash¶
Why: OOMKill and app crashes require completely different fixes. Confusing them wastes time.
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
Step 5: Fix the Root Cause¶
Why: The logs from Step 2 should now tell you what is wrong. Common causes and their fixes are below.
Case A — Missing ConfigMap or Secret:
# Check if the referenced ConfigMap exists
kubectl get configmap <CONFIGMAP_NAME> -n <NAMESPACE>
# Check if the referenced Secret exists
kubectl get secret <SECRET_NAME> -n <NAMESPACE>
# If missing, create it — example for ConfigMap from file
kubectl create configmap <CONFIGMAP_NAME> --from-file=<CONFIG_FILE_PATH> -n <NAMESPACE>
Case B — Wrong image tag or image does not exist:
# Check the current image
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].image}'
# Fix the image tag
kubectl set image deployment/<DEPLOYMENT_NAME> <CONTAINER_NAME>=<IMAGE>:<CORRECT_TAG> -n <NAMESPACE>
Case C — Insufficient resource limits causing startup failure:
# Check current limits
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].resources}'
# Edit deployment to adjust limits
kubectl edit deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
# Under resources.limits, increase memory or cpu as needed
Expected output after fix: Pod transitions from CrashLoopBackOff to Running within 1-2 minutes. If this fails: The root cause may be in application code or a dependency (database unreachable, upstream service down) — check app-level logs carefully and check dependent services.
Step 6: Trigger Rollout and Confirm¶
Why: Some fixes (like ConfigMap changes) require a pod restart to take effect. A rollout ensures a clean restart and records the change in deployment history.
kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out
kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>.
Verification¶
Success looks like: All pods showRunning with READY column showing 1/1 (or N/N for multi-container pods) and restart count is not incrementing.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | SRE on-call | "Kubernetes CrashLoopBackOff in |
| Data loss suspected (stateful workload) | Platform Lead | "Data loss risk: stateful pod |
| Scope expanding beyond namespace | Platform team | "Multi-namespace impact: CrashLoop spreading, possible shared dependency failure" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add the root cause to the team's known issues log
- Verify that the fix is reflected in the deployment manifest in git (not just patched live)
Common Mistakes¶
- Deleting the pod instead of fixing the deployment: Deleting a crashing pod causes Kubernetes to immediately recreate it from the same broken deployment spec. The crash loop continues. Always fix the Deployment (or StatefulSet/DaemonSet) spec, not just the pod.
- Reading only the current container logs: A CrashLoopBackOff pod restarts repeatedly. The current container logs are often empty or show only startup output because the container crashed within seconds. Always use
kubectl logs -pto get the previous (terminated) container's logs — that is where the actual error lives. - Assuming the fix is applied: After editing a Deployment, always run
kubectl rollout statusto confirm the new pods rolled out successfully. A broken admission webhook or image pull error can silently stall the rollout.
Tips and Gotchas¶
- CrashLoopBackOff is not a distinct error -- it means "container crashed and kubelet is backing off before restarting"
- The back-off timer resets after a pod runs successfully for 10 minutes
--previousonly works if there was a previous container instance; on first crash after pod creation, only current logs exist- A container exiting with code 0 can still cause CrashLoopBackOff if
restartPolicy: Alwaysand the process isn't meant to exit
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: oom-kill.md — if exit code is 137
- Related Runbook: deploy-stuck.md — if rollout stalls after fix
- Related Runbook: pvc-pending.md — if crash is caused by unbound volume
- Troubleshooting Guide:
training/library/guides/troubleshooting.md(CrashLoopBackOff section) - Lab:
training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/ - Lab:
training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/ - Incident Scenario:
training/interactive/incidents/scenarios/crashloop-bad-command.sh
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core
Pages that link here¶
- CrashLoopBackOff - Street-Level Ops
- Crashloopbackoff
- Decision Tree: Deployment Is Stuck
- Decision Tree: Pod Won't Start
- Decision Tree: Service Returning 5xx Errors
- Kubernetes Ops Domain
- Kubernetes Pod Lifecycle
- Level 3: Production Kubernetes
- On-Call Survival Guides
- Operational Runbooks
- Primer
- Runbook: Deployment Stuck / Rollout Stalled
- Runbook: HPA Thrashing (Rapid Scale Up/Down)
- Runbook: Ingress 502 Bad Gateway
- Runbook: Node NotReady