Skip to content

Runbook: Pod CrashLoopBackOff

Field Value
Domain Kubernetes
Alert container_restart_rate > 5/min or pod status = CrashLoopBackOff
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get pods -n <NAMESPACE> --field-selector=status.phase!=Running -o wide
If output shows: multiple pods across different deployments → This may be a node or cluster-wide problem, see node-not-ready.md If output shows: a single pod or all pods from one deployment → Continue with steps below

Step 1: Check Pod Status and Restart Count

Why: Confirms the CrashLoopBackOff and shows how long it has been failing, which sets urgency.

kubectl get pod <POD_NAME> -n <NAMESPACE> -o wide
kubectl get events -n <NAMESPACE> --field-selector=involvedObject.name=<POD_NAME> --sort-by='.lastTimestamp'
Expected output:
NAME          READY   STATUS             RESTARTS   AGE
my-app-xyz    0/1     CrashLoopBackOff   12         8m
If this fails: Verify you are in the correct namespace with kubectl config get-contexts and confirm the pod name with kubectl get pods -n <NAMESPACE>.

Step 2: Get Previous Container Logs

Why: Current logs often show nothing because the container crashed immediately. The -p flag retrieves the logs from the last terminated instance, which contains the actual error.

kubectl logs <POD_NAME> -n <NAMESPACE> -p --tail=100
Expected output:
# The actual error — stack trace, panic message, missing file, failed config parse, etc.
Error: cannot load config: open /etc/app/config.yaml: no such file or directory
If this fails: The pod may not have completed even one start cycle yet. Wait 30 seconds and retry, or check kubectl describe pod <POD_NAME> -n <NAMESPACE> for init container failures.

Step 3: Describe Pod for Exit Codes and Events

Why: The exit code tells you the category of failure. Exit code 1 = app crash, exit code 137 = OOMKilled (see oom-kill.md), exit code 126/127 = bad entrypoint/command.

kubectl describe pod <POD_NAME> -n <NAMESPACE>
Expected output — look for this section:
Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Thu, 19 Mar 2026 10:00:00 +0000
  Finished:     Thu, 19 Mar 2026 10:00:02 +0000
If exit code is 137: This is an OOMKill — stop here and follow oom-kill.md instead. If exit code is 1: Application-level crash — continue to Step 4. If exit code is 126 or 127: Bad command or entrypoint in the container spec — check spec.containers[].command and spec.containers[].args.

Step 4: Check for OOMKill vs Application Crash

Why: OOMKill and app crashes require completely different fixes. Confusing them wastes time.

kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
Expected output (OOMKill):
OOMKilled
Expected output (app crash):
Error
If output is OOMKilled: Follow oom-kill.md — do not continue here. If output is Error: Continue to Step 5.

Step 5: Fix the Root Cause

Why: The logs from Step 2 should now tell you what is wrong. Common causes and their fixes are below.

Case A — Missing ConfigMap or Secret:

# Check if the referenced ConfigMap exists
kubectl get configmap <CONFIGMAP_NAME> -n <NAMESPACE>
# Check if the referenced Secret exists
kubectl get secret <SECRET_NAME> -n <NAMESPACE>
# If missing, create it — example for ConfigMap from file
kubectl create configmap <CONFIGMAP_NAME> --from-file=<CONFIG_FILE_PATH> -n <NAMESPACE>

Case B — Wrong image tag or image does not exist:

# Check the current image
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].image}'
# Fix the image tag
kubectl set image deployment/<DEPLOYMENT_NAME> <CONTAINER_NAME>=<IMAGE>:<CORRECT_TAG> -n <NAMESPACE>

Case C — Insufficient resource limits causing startup failure:

# Check current limits
kubectl get deployment <DEPLOYMENT_NAME> -n <NAMESPACE> -o jsonpath='{.spec.template.spec.containers[0].resources}'
# Edit deployment to adjust limits
kubectl edit deployment <DEPLOYMENT_NAME> -n <NAMESPACE>
# Under resources.limits, increase memory or cpu as needed

Expected output after fix: Pod transitions from CrashLoopBackOff to Running within 1-2 minutes. If this fails: The root cause may be in application code or a dependency (database unreachable, upstream service down) — check app-level logs carefully and check dependent services.

Step 6: Trigger Rollout and Confirm

Why: Some fixes (like ConfigMap changes) require a pod restart to take effect. A rollout ensures a clean restart and records the change in deployment history.

kubectl rollout restart deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
kubectl rollout status deployment/<DEPLOYMENT_NAME> -n <NAMESPACE> --timeout=5m
Expected output:
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "my-app" successfully rolled out
If this fails: The new pods are also crashing — recheck logs on the new pods with Step 2, or rollback with kubectl rollout undo deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>.

Verification

# Confirm the issue is resolved
kubectl get pods -n <NAMESPACE> -l app=<APP_LABEL> -w
Success looks like: All pods show Running with READY column showing 1/1 (or N/N for multi-container pods) and restart count is not incrementing. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min SRE on-call "Kubernetes CrashLoopBackOff in , pod , exit code , runbook exhausted"
Data loss suspected (stateful workload) Platform Lead "Data loss risk: stateful pod in CrashLoop in , possible volume corruption"
Scope expanding beyond namespace Platform team "Multi-namespace impact: CrashLoop spreading, possible shared dependency failure"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Add the root cause to the team's known issues log
  • Verify that the fix is reflected in the deployment manifest in git (not just patched live)

Common Mistakes

  1. Deleting the pod instead of fixing the deployment: Deleting a crashing pod causes Kubernetes to immediately recreate it from the same broken deployment spec. The crash loop continues. Always fix the Deployment (or StatefulSet/DaemonSet) spec, not just the pod.
  2. Reading only the current container logs: A CrashLoopBackOff pod restarts repeatedly. The current container logs are often empty or show only startup output because the container crashed within seconds. Always use kubectl logs -p to get the previous (terminated) container's logs — that is where the actual error lives.
  3. Assuming the fix is applied: After editing a Deployment, always run kubectl rollout status to confirm the new pods rolled out successfully. A broken admission webhook or image pull error can silently stall the rollout.

Tips and Gotchas

  • CrashLoopBackOff is not a distinct error -- it means "container crashed and kubelet is backing off before restarting"
  • The back-off timer resets after a pod runs successfully for 10 minutes
  • --previous only works if there was a previous container instance; on first crash after pod creation, only current logs exist
  • A container exiting with code 0 can still cause CrashLoopBackOff if restartPolicy: Always and the process isn't meant to exit

Cross-References

  • Survival Guide: On-Call Survival Guide (pocket card version)
  • Topic Pack: Kubernetes Topics (deep background)
  • Related Runbook: oom-kill.md — if exit code is 137
  • Related Runbook: deploy-stuck.md — if rollout stalls after fix
  • Related Runbook: pvc-pending.md — if crash is caused by unbound volume
  • Troubleshooting Guide: training/library/guides/troubleshooting.md (CrashLoopBackOff section)
  • Lab: training/interactive/runtime-labs/lab-runtime-01-rollout-probe-failure/
  • Lab: training/interactive/runtime-labs/lab-runtime-08-resource-limits-oom/
  • Incident Scenario: training/interactive/incidents/scenarios/crashloop-bad-command.sh

Wiki Navigation