CrashLoopBackOff - Street-Level Ops¶

Real-world workflows for diagnosing and fixing CrashLoopBackOff in production.

Triage: What Is Crashing and Why¶

# See all crashing pods across namespaces
kubectl get pods -A | grep CrashLoopBackOff

# Quick view: name, restart count, age
kubectl get pods -n production -o wide | grep -E 'CrashLoop|BackOff'

# Get the exit code without scrolling through describe
kubectl get pod payment-api-7d4f8b6-x2k9f -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Output: 1

# Get the reason
kubectl get pod payment-api-7d4f8b6-x2k9f -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Output: Error

Under the hood: CrashLoopBackOff is not a pod status — it is a waiting state. The kubelet uses exponential backoff for restarts: 10s, 20s, 40s, 80s, 160s, capped at 300s (5 minutes). After a container runs successfully for 10 minutes, the backoff timer resets to 10s. This means a pod that crashes every 6 minutes never escapes the backoff penalty.

Read the Logs (Most Crashes End Here)¶

# Logs from the PREVIOUS (crashed) container — this is the key flag
kubectl logs payment-api-7d4f8b6-x2k9f --previous

# If multi-container pod, specify the container
kubectl logs payment-api-7d4f8b6-x2k9f -c payment-api --previous

# Tail logs from the current (possibly still starting) container
kubectl logs payment-api-7d4f8b6-x2k9f --follow --tail=100

# Grab logs from all pods in a deployment at once
kubectl logs deployment/payment-api --previous --prefix

Exit Code Decode¶

# Exit code 1 — app error, read the logs
# Exit code 126 — permission denied on entrypoint binary
# Exit code 127 — entrypoint not found (wrong image tag or bad command)
# Exit code 137 — OOMKilled or SIGKILL, check describe for reason
# Exit code 139 — segfault
# Exit code 143 — SIGTERM (liveness probe killed it)

# If exit code 137, confirm OOM vs manual kill
kubectl describe pod payment-api-7d4f8b6-x2k9f | grep -A3 "Last State"
#     Last State:     Terminated
#       Reason:       OOMKilled
#       Exit Code:    137

Missing Config or Secret¶

# Check if the pod references a secret or configmap that does not exist
kubectl describe pod myapp-abc123 | grep -A 10 "Environment"

# List secrets in the namespace — is the expected one present?
kubectl get secrets -n production

# Check if the key exists inside the secret
kubectl get secret db-credentials -o jsonpath='{.data}' | jq 'keys'

Debug a Container That Crashes Too Fast¶

# Override entrypoint — keep it alive with sleep so you can exec in
kubectl run debug-myapp --image=myapp:v2.3.1 --restart=Never \
  --command -- sleep 3600

kubectl exec -it debug-myapp -- /bin/sh
# Now manually run the app entrypoint and watch it fail

# Ephemeral debug container (K8s 1.23+)
kubectl debug -it payment-api-7d4f8b6-x2k9f --image=busybox:1.36 \
  --target=payment-api

# Check filesystem, env, and network from the debug container
env | grep DATABASE
cat /app/config.yaml
nc -z postgres-svc 5432

Remember: Exit code mnemonic: 1 = app said "I broke," 137 = someone said "you're killed" (128 + signal 9 = SIGKILL, usually OOM), 143 = someone said "please stop" (128 + signal 15 = SIGTERM, usually liveness probe). If exit code > 128, subtract 128 to get the signal number.

Liveness Probe Killing the Pod¶

# Check events for liveness probe failures
kubectl describe pod myapp-abc123 | grep -i "liveness\|unhealthy\|killing"
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
# Normal   Killing    Container myapp failed liveness probe, will be restarted

# Exec in and test the probe endpoint manually
kubectl exec -it myapp-abc123 -- curl -v http://localhost:8080/healthz

# Fix: add a startup probe for slow-starting apps
# startupProbe:
#   httpGet:
#     path: /healthz
#     port: 8080
#   failureThreshold: 30
#   periodSeconds: 10

Default trap: If you do not define a startupProbe, the livenessProbe starts checking immediately. Java/Spring apps that take 30+ seconds to start get killed by the liveness probe before they finish initializing, creating a CrashLoopBackOff that looks like the app is broken. Always add a startupProbe with failureThreshold * periodSeconds >= your worst-case startup time.

Init Container Stuck¶

# Check if the init container is blocking the main container
kubectl get pod myapp-abc123 -o jsonpath='{.status.initContainerStatuses[*].state}'

# Read init container logs
kubectl logs myapp-abc123 -c wait-for-db

# Common pattern: init container waiting for a service
# Fix the dependency or check DNS resolution
kubectl exec -it myapp-abc123 -c wait-for-db -- nslookup postgres-svc

Batch Fix: Restart a Stuck Deployment¶

# Rolling restart — creates new pods with fresh containers
kubectl rollout restart deployment/payment-api -n production

# Watch the rollout
kubectl rollout status deployment/payment-api -n production

# If the new pods also crash, rollback
kubectl rollout undo deployment/payment-api -n production

Gotcha: kubectl rollout undo reverts to the previous ReplicaSet, but if the previous version ALSO had the bug (e.g., you deployed v2 with a bug, then deployed v3 with a different bug), undo rolls back to v2 which is still broken. Use kubectl rollout history deployment/<name> and kubectl rollout undo deployment/<name> --to-revision=<N> to target a known-good revision.

Events Timeline¶

# All events for a specific pod, sorted by time
kubectl get events --sort-by='.lastTimestamp' \
  --field-selector involvedObject.name=payment-api-7d4f8b6-x2k9f

# All warning events in a namespace
kubectl get events -n production --field-selector type=Warning --sort-by='.lastTimestamp'

Quick Checklist¶

# 1. What is the exit code?
kubectl get pod $POD -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# 2. What do the logs say?
kubectl logs $POD --previous

# 3. Is it OOMKilled?
kubectl describe pod $POD | grep OOMKilled

# 4. Is it a probe failure?
kubectl describe pod $POD | grep -i "unhealthy\|probe failed"

# 5. Is a config/secret missing?
kubectl describe pod $POD | grep -i "configmap\|secret\|not found"

# 6. Can the container reach its dependencies?
kubectl exec -it $POD -- nc -z db-service 5432

Quick Reference¶

Runbook: Crashloopbackoff

CrashLoopBackOff - Street-Level Ops¶

Triage: What Is Crashing and Why¶

Read the Logs (Most Crashes End Here)¶

Exit Code Decode¶

Missing Config or Secret¶

Debug a Container That Crashes Too Fast¶

Liveness Probe Killing the Pod¶

Init Container Stuck¶

Batch Fix: Restart a Stuck Deployment¶

Events Timeline¶

Quick Checklist¶

Quick Reference¶

Pages that link here¶