CrashLoopBackOff - Street-Level Ops¶
Real-world workflows for diagnosing and fixing CrashLoopBackOff in production.
Triage: What Is Crashing and Why¶
# See all crashing pods across namespaces
kubectl get pods -A | grep CrashLoopBackOff
# Quick view: name, restart count, age
kubectl get pods -n production -o wide | grep -E 'CrashLoop|BackOff'
# Get the exit code without scrolling through describe
kubectl get pod payment-api-7d4f8b6-x2k9f -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Output: 1
# Get the reason
kubectl get pod payment-api-7d4f8b6-x2k9f -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Output: Error
Under the hood: CrashLoopBackOff is not a pod status — it is a waiting state. The kubelet uses exponential backoff for restarts: 10s, 20s, 40s, 80s, 160s, capped at 300s (5 minutes). After a container runs successfully for 10 minutes, the backoff timer resets to 10s. This means a pod that crashes every 6 minutes never escapes the backoff penalty.
Read the Logs (Most Crashes End Here)¶
# Logs from the PREVIOUS (crashed) container — this is the key flag
kubectl logs payment-api-7d4f8b6-x2k9f --previous
# If multi-container pod, specify the container
kubectl logs payment-api-7d4f8b6-x2k9f -c payment-api --previous
# Tail logs from the current (possibly still starting) container
kubectl logs payment-api-7d4f8b6-x2k9f --follow --tail=100
# Grab logs from all pods in a deployment at once
kubectl logs deployment/payment-api --previous --prefix
Exit Code Decode¶
# Exit code 1 — app error, read the logs
# Exit code 126 — permission denied on entrypoint binary
# Exit code 127 — entrypoint not found (wrong image tag or bad command)
# Exit code 137 — OOMKilled or SIGKILL, check describe for reason
# Exit code 139 — segfault
# Exit code 143 — SIGTERM (liveness probe killed it)
# If exit code 137, confirm OOM vs manual kill
kubectl describe pod payment-api-7d4f8b6-x2k9f | grep -A3 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Missing Config or Secret¶
# Check if the pod references a secret or configmap that does not exist
kubectl describe pod myapp-abc123 | grep -A 10 "Environment"
# List secrets in the namespace — is the expected one present?
kubectl get secrets -n production
# Check if the key exists inside the secret
kubectl get secret db-credentials -o jsonpath='{.data}' | jq 'keys'
Debug a Container That Crashes Too Fast¶
# Override entrypoint — keep it alive with sleep so you can exec in
kubectl run debug-myapp --image=myapp:v2.3.1 --restart=Never \
--command -- sleep 3600
kubectl exec -it debug-myapp -- /bin/sh
# Now manually run the app entrypoint and watch it fail
# Ephemeral debug container (K8s 1.23+)
kubectl debug -it payment-api-7d4f8b6-x2k9f --image=busybox:1.36 \
--target=payment-api
# Check filesystem, env, and network from the debug container
env | grep DATABASE
cat /app/config.yaml
nc -z postgres-svc 5432
Remember: Exit code mnemonic: 1 = app said "I broke," 137 = someone said "you're killed" (128 + signal 9 = SIGKILL, usually OOM), 143 = someone said "please stop" (128 + signal 15 = SIGTERM, usually liveness probe). If exit code > 128, subtract 128 to get the signal number.
Liveness Probe Killing the Pod¶
# Check events for liveness probe failures
kubectl describe pod myapp-abc123 | grep -i "liveness\|unhealthy\|killing"
# Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
# Normal Killing Container myapp failed liveness probe, will be restarted
# Exec in and test the probe endpoint manually
kubectl exec -it myapp-abc123 -- curl -v http://localhost:8080/healthz
# Fix: add a startup probe for slow-starting apps
# startupProbe:
# httpGet:
# path: /healthz
# port: 8080
# failureThreshold: 30
# periodSeconds: 10
Default trap: If you do not define a
startupProbe, thelivenessProbestarts checking immediately. Java/Spring apps that take 30+ seconds to start get killed by the liveness probe before they finish initializing, creating a CrashLoopBackOff that looks like the app is broken. Always add astartupProbewithfailureThreshold * periodSeconds>= your worst-case startup time.
Init Container Stuck¶
# Check if the init container is blocking the main container
kubectl get pod myapp-abc123 -o jsonpath='{.status.initContainerStatuses[*].state}'
# Read init container logs
kubectl logs myapp-abc123 -c wait-for-db
# Common pattern: init container waiting for a service
# Fix the dependency or check DNS resolution
kubectl exec -it myapp-abc123 -c wait-for-db -- nslookup postgres-svc
Batch Fix: Restart a Stuck Deployment¶
# Rolling restart — creates new pods with fresh containers
kubectl rollout restart deployment/payment-api -n production
# Watch the rollout
kubectl rollout status deployment/payment-api -n production
# If the new pods also crash, rollback
kubectl rollout undo deployment/payment-api -n production
Gotcha:
kubectl rollout undoreverts to the previous ReplicaSet, but if the previous version ALSO had the bug (e.g., you deployed v2 with a bug, then deployed v3 with a different bug), undo rolls back to v2 which is still broken. Usekubectl rollout history deployment/<name>andkubectl rollout undo deployment/<name> --to-revision=<N>to target a known-good revision.
Events Timeline¶
# All events for a specific pod, sorted by time
kubectl get events --sort-by='.lastTimestamp' \
--field-selector involvedObject.name=payment-api-7d4f8b6-x2k9f
# All warning events in a namespace
kubectl get events -n production --field-selector type=Warning --sort-by='.lastTimestamp'
Quick Checklist¶
# 1. What is the exit code?
kubectl get pod $POD -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# 2. What do the logs say?
kubectl logs $POD --previous
# 3. Is it OOMKilled?
kubectl describe pod $POD | grep OOMKilled
# 4. Is it a probe failure?
kubectl describe pod $POD | grep -i "unhealthy\|probe failed"
# 5. Is a config/secret missing?
kubectl describe pod $POD | grep -i "configmap\|secret\|not found"
# 6. Can the container reach its dependencies?
kubectl exec -it $POD -- nc -z db-service 5432
Quick Reference¶
- Runbook: Crashloopbackoff