Skip to content

CrashLoopBackOff - Primer

Why This Matters

CrashLoopBackOff is the single most common Kubernetes issue you will encounter and troubleshoot. It appears in nearly every production cluster, every on-call rotation, and every Kubernetes interview. If you cannot diagnose and resolve CrashLoopBackOff, you cannot operate Kubernetes. It is the #1 troubleshooting question because it tests whether you understand the container lifecycle, know the diagnostic tools, and can reason about failure modes under pressure.

Core Concepts

1. What CrashLoopBackOff Actually Means

CrashLoopBackOff is not an error — it is a status. It means the kubelet has tried to start a container, the container ran and exited, and the kubelet is now waiting before trying again.

The backoff follows an exponential schedule:

Restart Delay
1st 10s
2nd 20s
3rd 40s
4th 80s
5th 160s
6th+ 300s (5 min cap)

The container keeps restarting indefinitely. Kubernetes never gives up — it just slows down. The backoff resets after a container runs successfully for 10 minutes.

Container starts → crashes → 10s wait → starts → crashes → 20s wait → starts → crashes → 40s wait → ...

This is by design. The kubelet assumes the failure might be transient (a dependency not yet ready, a config that will be corrected), so it keeps retrying with increasing patience.

Remember: The CrashLoopBackOff exponential backoff sequence: 10, 20, 40, 80, 160, 300 (cap). Mnemonic: "doubles until five minutes." The backoff resets after the container runs successfully for 10 minutes. If you see a pod with 50+ restarts, it has been failing for hours and each restart attempt is 5 minutes apart.

Under the hood: CrashLoopBackOff is NOT a Kubernetes error state — it is a kubelet status. The pod's .status.phase is still Running. The container's .state.waiting.reason is CrashLoopBackOff. This distinction matters for monitoring: alerting on pod phase alone will miss CrashLoopBackOff pods. Alert on kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0.

2. The Diagnostic Workflow

Step 1: See the problem

$ kubectl get pods
NAME                        READY   STATUS             RESTARTS      AGE
payment-api-7d4f8b6-x2k9f  0/1     CrashLoopBackOff   7 (2m ago)    12m
payment-api-7d4f8b6-t8m3p  0/1     CrashLoopBackOff   7 (3m ago)    12m

The restart count and age tell you how long this has been happening. 7 restarts in 12 minutes means it started crashing immediately on deploy.

Step 2: Get the details

$ kubectl describe pod payment-api-7d4f8b6-x2k9f
...
Containers:
  payment-api:
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Mar 2026 10:22:18 +0000
      Finished:     Sun, 15 Mar 2026 10:22:19 +0000
    Ready:          False
    Restart Count:  7
...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Pulled     2m (x8 over 12m)   kubelet            Container image "payment-api:v2.3.1" already present
  Normal   Created    2m (x8 over 12m)   kubelet            Created container payment-api
  Normal   Started    2m (x8 over 12m)   kubelet            Started container payment-api
  Warning  BackOff    45s (x25 over 11m) kubelet            Back-off restarting failed container

Key information: the exit code (1 = application error), the container ran for 1 second (started 10:22:18, finished 10:22:19), and the events show 8 pull/create/start cycles.

Step 3: Read the logs

$ kubectl logs payment-api-7d4f8b6-x2k9f --previous
Error: required environment variable DATABASE_URL is not set
Traceback (most recent call last):
  File "/app/main.py", line 12, in <module>
    db_url = os.environ["DATABASE_URL"]
KeyError: 'DATABASE_URL'

The --previous flag is critical — it shows logs from the last terminated container. Without it, you may get no output if the current container has not started yet.

3. Exit Code Meanings

Exit codes tell you the category of failure before you even look at logs:

Exit Code Signal Meaning
0 Success. Should not cause CrashLoopBackOff unless restartPolicy: Always (the default) restarts even on clean exit
1 Generic application error. Check logs
126 Permission denied — the entrypoint binary exists but cannot be executed
127 Command not found — the entrypoint binary does not exist in the image
137 SIGKILL OOMKilled (kernel killed the process) or manual kill -9. Check kubectl describe for OOMKilled reason
139 SIGSEGV Segmentation fault — the process accessed invalid memory
143 SIGTERM Graceful termination signal. Usually means the container was killed by Kubernetes (liveness probe failure, preStop timeout)

Remember: Exit code cheat sheet: 1 = app error (check logs), 126 = permission denied (chmod), 127 = not found (check entrypoint path), 137 = OOMKilled or SIGKILL (check memory limits), 143 = SIGTERM (check liveness probe). Mnemonic: "1-bad, 127-missing, 137-killed."

Exit code 137 is the most operationally important. When you see it, immediately check:

$ kubectl describe pod <name> | grep -A 3 "Last State"
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

Gotcha: Exit code 0 CAN cause CrashLoopBackOff. If the container's restartPolicy is Always (the default for Deployments), a container that exits cleanly with code 0 will be restarted. If the process exits immediately (e.g., a one-shot script in a Deployment instead of a Job), it enters CrashLoopBackOff despite "succeeding." Use a Job for one-shot tasks, or set restartPolicy: OnFailure for pods that should not restart on clean exit.

Debug clue: When kubectl logs --previous returns "container not found" or is empty, the container crashed before writing any output. Common causes: binary not found (exit 127), permission denied on entrypoint (exit 126), or immediate segfault (exit 139). In these cases, kubectl describe pod events and the exit code are your only clues. Exit 127 + "exec format error" usually means a Linux amd64 image running on an ARM node (or vice versa).

4. Common Causes

Missing config or secrets:

$ kubectl describe pod myapp-abc123 | grep -A 5 "Environment"
    Environment:
      DATABASE_URL:  <set to the key 'url' in secret 'db-credentials'>  Optional: false

If the secret db-credentials does not exist, the pod fails to start entirely (CreateContainerConfigError). But if the secret exists with the wrong key name, the env var is empty and the app crashes on startup — CrashLoopBackOff.

Bad command or entrypoint:

$ kubectl logs myapp-abc123 --previous
exec /app/server: no such file or directory

This happens when the image tag is wrong (pulled a different version), the Dockerfile ENTRYPOINT changed, or the command override in the pod spec has a typo.

OOMKilled (exit code 137):

The container exceeded its memory limit. The kernel OOM killer terminates it immediately.

resources:
  limits:
    memory: "128Mi"   # Too low for a Java app

Failing liveness probe:

The container starts but the liveness probe fails, so Kubernetes kills it. The container restarts, the probe fails again, and you get CrashLoopBackOff. This is especially common when initialDelaySeconds is too short for apps with slow startup (JVM, .NET).

Missing dependencies:

The app tries to connect to a database or service on startup and crashes when the connection fails. The dependency may not be ready yet, or the service name may be wrong.

Permission issues:

A non-root container trying to bind to port 80 (requires root), or trying to write to a read-only filesystem.

5. Debugging Without Logs

Sometimes kubectl logs --previous shows nothing — the container crashed too fast, or the issue is environmental.

If the container stays up briefly, exec into it:

$ kubectl exec -it myapp-abc123 -- /bin/sh

Use ephemeral debug containers (Kubernetes 1.23+):

$ kubectl debug -it myapp-abc123 --image=busybox:1.36 --target=myapp

This attaches a debug container to the same pod, sharing the process namespace. You can inspect the filesystem, network, and environment.

Override the entrypoint to keep the container alive:

$ kubectl run debug-myapp --image=myapp:v2.3.1 --restart=Never \
    --overrides='{"spec":{"containers":[{"name":"debug","image":"myapp:v2.3.1","command":["sleep","3600"]}]}}'

Now exec into debug-myapp and manually run the application entrypoint to see the error interactively.

Check events at the node level:

$ kubectl get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=myapp-abc123

Interview tip: The CrashLoopBackOff diagnostic sequence that interviewers want to hear: 1) kubectl get pods (see the status and restart count), 2) kubectl describe pod (exit code and events), 3) kubectl logs --previous (why it crashed), 4) check exit code meaning (1=app error, 137=OOM, 127=not found). If you can walk through this in 30 seconds with the reasoning at each step, you demonstrate operational fluency.

6. Prevention

Startup probes for slow applications:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

This gives the app 300 seconds (30 x 10) to start before the liveness probe takes over. Without this, a JVM app that takes 60 seconds to start will be killed by the liveness probe.

Init containers for dependency checks:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -z postgres-svc 5432; do echo waiting; sleep 2; done']

The main container does not start until all init containers complete successfully.

Resource requests and limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Set requests based on normal usage, limits based on peak. For memory, monitor actual usage before setting limits — kubectl top pod or Prometheus metrics.

Proper signal handling (PID 1 problem):

In a container, the entrypoint runs as PID 1. If PID 1 does not handle SIGTERM, kubectl stop sends SIGTERM, the app ignores it, Kubernetes waits 30 seconds, then sends SIGKILL (exit 137). Use exec form in Dockerfile or a lightweight init like tini:

ENTRYPOINT ["tini", "--", "python", "app.py"]

Graceful shutdown:

Handle SIGTERM in your application code to close connections, flush buffers, and exit cleanly. This prevents data corruption and reduces restart time.

7. Distinguishing from Other Failures

Not every pod failure is CrashLoopBackOff. Knowing the difference saves diagnostic time:

Status Meaning Key Difference
CrashLoopBackOff Container starts then crashes Container ran — check logs
ImagePullBackOff Cannot pull the container image Image name/tag wrong, registry auth failed, or registry unreachable
CreateContainerConfigError Cannot configure the container Missing ConfigMap or Secret referenced by the pod spec
RunContainerError Cannot start the container process Runtime failure — bad seccomp profile, missing device, cgroup issue
ErrImageNeverPull Image not present and imagePullPolicy: Never Local image missing on the node
Pending Pod not scheduled No node with enough resources, or affinity/taint mismatch

The critical distinction: in CrashLoopBackOff, the container did start and run. In the others, it never got that far. This tells you where to look — application code and config vs infrastructure and scheduling.

# Quick triage: what phase is the pod stuck in?
$ kubectl get pod myapp-abc123 -o jsonpath='{.status.phase}'
Running   # CrashLoopBackOff pods are technically still "Running" phase

$ kubectl get pod myapp-abc123 -o jsonpath='{.status.containerStatuses[0].state}'
{"waiting":{"message":"back-off 5m0s restarting failed container=myapp pod=myapp-abc123","reason":"CrashLoopBackOff"}}

Wiki Navigation

  • Crashloopbackoff Flashcards (CLI) (flashcard_deck, L1) — CrashLoopBackOff (alias)