- k8s
- l1
- topic-pack
- crashloopbackoff --- Portal | Level: L1: Foundations | Topics: CrashLoopBackOff (alias) | Domain: Kubernetes
CrashLoopBackOff - Primer¶
Why This Matters¶
CrashLoopBackOff is the single most common Kubernetes issue you will encounter and troubleshoot. It appears in nearly every production cluster, every on-call rotation, and every Kubernetes interview. If you cannot diagnose and resolve CrashLoopBackOff, you cannot operate Kubernetes. It is the #1 troubleshooting question because it tests whether you understand the container lifecycle, know the diagnostic tools, and can reason about failure modes under pressure.
Core Concepts¶
1. What CrashLoopBackOff Actually Means¶
CrashLoopBackOff is not an error — it is a status. It means the kubelet has tried to start a container, the container ran and exited, and the kubelet is now waiting before trying again.
The backoff follows an exponential schedule:
| Restart | Delay |
|---|---|
| 1st | 10s |
| 2nd | 20s |
| 3rd | 40s |
| 4th | 80s |
| 5th | 160s |
| 6th+ | 300s (5 min cap) |
The container keeps restarting indefinitely. Kubernetes never gives up — it just slows down. The backoff resets after a container runs successfully for 10 minutes.
Container starts → crashes → 10s wait → starts → crashes → 20s wait → starts → crashes → 40s wait → ...
This is by design. The kubelet assumes the failure might be transient (a dependency not yet ready, a config that will be corrected), so it keeps retrying with increasing patience.
Remember: The CrashLoopBackOff exponential backoff sequence: 10, 20, 40, 80, 160, 300 (cap). Mnemonic: "doubles until five minutes." The backoff resets after the container runs successfully for 10 minutes. If you see a pod with 50+ restarts, it has been failing for hours and each restart attempt is 5 minutes apart.
Under the hood: CrashLoopBackOff is NOT a Kubernetes error state — it is a kubelet status. The pod's
.status.phaseis stillRunning. The container's.state.waiting.reasonisCrashLoopBackOff. This distinction matters for monitoring: alerting on pod phase alone will miss CrashLoopBackOff pods. Alert onkube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0.
2. The Diagnostic Workflow¶
Step 1: See the problem
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
payment-api-7d4f8b6-x2k9f 0/1 CrashLoopBackOff 7 (2m ago) 12m
payment-api-7d4f8b6-t8m3p 0/1 CrashLoopBackOff 7 (3m ago) 12m
The restart count and age tell you how long this has been happening. 7 restarts in 12 minutes means it started crashing immediately on deploy.
Step 2: Get the details
$ kubectl describe pod payment-api-7d4f8b6-x2k9f
...
Containers:
payment-api:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 15 Mar 2026 10:22:18 +0000
Finished: Sun, 15 Mar 2026 10:22:19 +0000
Ready: False
Restart Count: 7
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 2m (x8 over 12m) kubelet Container image "payment-api:v2.3.1" already present
Normal Created 2m (x8 over 12m) kubelet Created container payment-api
Normal Started 2m (x8 over 12m) kubelet Started container payment-api
Warning BackOff 45s (x25 over 11m) kubelet Back-off restarting failed container
Key information: the exit code (1 = application error), the container ran for 1 second (started 10:22:18, finished 10:22:19), and the events show 8 pull/create/start cycles.
Step 3: Read the logs
$ kubectl logs payment-api-7d4f8b6-x2k9f --previous
Error: required environment variable DATABASE_URL is not set
Traceback (most recent call last):
File "/app/main.py", line 12, in <module>
db_url = os.environ["DATABASE_URL"]
KeyError: 'DATABASE_URL'
The --previous flag is critical — it shows logs from the last terminated container. Without it, you may get no output if the current container has not started yet.
3. Exit Code Meanings¶
Exit codes tell you the category of failure before you even look at logs:
| Exit Code | Signal | Meaning |
|---|---|---|
| 0 | — | Success. Should not cause CrashLoopBackOff unless restartPolicy: Always (the default) restarts even on clean exit |
| 1 | — | Generic application error. Check logs |
| 126 | — | Permission denied — the entrypoint binary exists but cannot be executed |
| 127 | — | Command not found — the entrypoint binary does not exist in the image |
| 137 | SIGKILL | OOMKilled (kernel killed the process) or manual kill -9. Check kubectl describe for OOMKilled reason |
| 139 | SIGSEGV | Segmentation fault — the process accessed invalid memory |
| 143 | SIGTERM | Graceful termination signal. Usually means the container was killed by Kubernetes (liveness probe failure, preStop timeout) |
Remember: Exit code cheat sheet: 1 = app error (check logs), 126 = permission denied (chmod), 127 = not found (check entrypoint path), 137 = OOMKilled or SIGKILL (check memory limits), 143 = SIGTERM (check liveness probe). Mnemonic: "1-bad, 127-missing, 137-killed."
Exit code 137 is the most operationally important. When you see it, immediately check:
$ kubectl describe pod <name> | grep -A 3 "Last State"
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Gotcha: Exit code 0 CAN cause CrashLoopBackOff. If the container's
restartPolicyisAlways(the default for Deployments), a container that exits cleanly with code 0 will be restarted. If the process exits immediately (e.g., a one-shot script in a Deployment instead of a Job), it enters CrashLoopBackOff despite "succeeding." Use a Job for one-shot tasks, or setrestartPolicy: OnFailurefor pods that should not restart on clean exit.Debug clue: When
kubectl logs --previousreturns "container not found" or is empty, the container crashed before writing any output. Common causes: binary not found (exit 127), permission denied on entrypoint (exit 126), or immediate segfault (exit 139). In these cases,kubectl describe podevents and the exit code are your only clues. Exit 127 + "exec format error" usually means a Linux amd64 image running on an ARM node (or vice versa).
4. Common Causes¶
Missing config or secrets:
$ kubectl describe pod myapp-abc123 | grep -A 5 "Environment"
Environment:
DATABASE_URL: <set to the key 'url' in secret 'db-credentials'> Optional: false
If the secret db-credentials does not exist, the pod fails to start entirely (CreateContainerConfigError). But if the secret exists with the wrong key name, the env var is empty and the app crashes on startup — CrashLoopBackOff.
Bad command or entrypoint:
This happens when the image tag is wrong (pulled a different version), the Dockerfile ENTRYPOINT changed, or the command override in the pod spec has a typo.
OOMKilled (exit code 137):
The container exceeded its memory limit. The kernel OOM killer terminates it immediately.
Failing liveness probe:
The container starts but the liveness probe fails, so Kubernetes kills it. The container restarts, the probe fails again, and you get CrashLoopBackOff. This is especially common when initialDelaySeconds is too short for apps with slow startup (JVM, .NET).
Missing dependencies:
The app tries to connect to a database or service on startup and crashes when the connection fails. The dependency may not be ready yet, or the service name may be wrong.
Permission issues:
A non-root container trying to bind to port 80 (requires root), or trying to write to a read-only filesystem.
5. Debugging Without Logs¶
Sometimes kubectl logs --previous shows nothing — the container crashed too fast, or the issue is environmental.
If the container stays up briefly, exec into it:
Use ephemeral debug containers (Kubernetes 1.23+):
This attaches a debug container to the same pod, sharing the process namespace. You can inspect the filesystem, network, and environment.
Override the entrypoint to keep the container alive:
$ kubectl run debug-myapp --image=myapp:v2.3.1 --restart=Never \
--overrides='{"spec":{"containers":[{"name":"debug","image":"myapp:v2.3.1","command":["sleep","3600"]}]}}'
Now exec into debug-myapp and manually run the application entrypoint to see the error interactively.
Check events at the node level:
Interview tip: The CrashLoopBackOff diagnostic sequence that interviewers want to hear: 1)
kubectl get pods(see the status and restart count), 2)kubectl describe pod(exit code and events), 3)kubectl logs --previous(why it crashed), 4) check exit code meaning (1=app error, 137=OOM, 127=not found). If you can walk through this in 30 seconds with the reasoning at each step, you demonstrate operational fluency.
6. Prevention¶
Startup probes for slow applications:
This gives the app 300 seconds (30 x 10) to start before the liveness probe takes over. Without this, a JVM app that takes 60 seconds to start will be killed by the liveness probe.
Init containers for dependency checks:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres-svc 5432; do echo waiting; sleep 2; done']
The main container does not start until all init containers complete successfully.
Resource requests and limits:
Set requests based on normal usage, limits based on peak. For memory, monitor actual usage before setting limits — kubectl top pod or Prometheus metrics.
Proper signal handling (PID 1 problem):
In a container, the entrypoint runs as PID 1. If PID 1 does not handle SIGTERM, kubectl stop sends SIGTERM, the app ignores it, Kubernetes waits 30 seconds, then sends SIGKILL (exit 137). Use exec form in Dockerfile or a lightweight init like tini:
Graceful shutdown:
Handle SIGTERM in your application code to close connections, flush buffers, and exit cleanly. This prevents data corruption and reduces restart time.
7. Distinguishing from Other Failures¶
Not every pod failure is CrashLoopBackOff. Knowing the difference saves diagnostic time:
| Status | Meaning | Key Difference |
|---|---|---|
| CrashLoopBackOff | Container starts then crashes | Container ran — check logs |
| ImagePullBackOff | Cannot pull the container image | Image name/tag wrong, registry auth failed, or registry unreachable |
| CreateContainerConfigError | Cannot configure the container | Missing ConfigMap or Secret referenced by the pod spec |
| RunContainerError | Cannot start the container process | Runtime failure — bad seccomp profile, missing device, cgroup issue |
| ErrImageNeverPull | Image not present and imagePullPolicy: Never |
Local image missing on the node |
| Pending | Pod not scheduled | No node with enough resources, or affinity/taint mismatch |
The critical distinction: in CrashLoopBackOff, the container did start and run. In the others, it never got that far. This tells you where to look — application code and config vs infrastructure and scheduling.
# Quick triage: what phase is the pod stuck in?
$ kubectl get pod myapp-abc123 -o jsonpath='{.status.phase}'
Running # CrashLoopBackOff pods are technically still "Running" phase
$ kubectl get pod myapp-abc123 -o jsonpath='{.status.containerStatuses[0].state}'
{"waiting":{"message":"back-off 5m0s restarting failed container=myapp pod=myapp-abc123","reason":"CrashLoopBackOff"}}
Wiki Navigation¶
Related Content¶
- Crashloopbackoff Flashcards (CLI) (flashcard_deck, L1) — CrashLoopBackOff (alias)