CrashLoopBackOff Footguns¶
Mistakes that turn a simple container crash into an extended outage.
1. Forgetting --previous when reading logs¶
You run kubectl logs crashing-pod and get no output or a partial startup message. The current container just started and has not logged anything yet. The crash output is in the previous container.
What happens: You waste time chasing the wrong lead or assume the app produces no logs.
Why: Without --previous, you see the current (possibly just-spawned) container's logs, not the one that crashed.
How to avoid: Always use kubectl logs <pod> --previous for CrashLoopBackOff pods. Make it muscle memory.
Debug clue: If
--previousreturns "previous terminated container not found," the pod has restarted enough times that the previous container's logs were garbage collected. Use a log aggregator (Loki, CloudWatch) to catch crash logs before they vanish. Kubelet only retains logs for the most recent previous container, not all historical crashes.
2. Using the same endpoint for liveness and readiness¶
Your /health endpoint checks database connectivity. The database goes down for 30 seconds. Every pod fails its liveness probe and gets restarted. All pods restart simultaneously and thundering-herd the database.
What happens: A transient dependency failure cascades into a full application outage.
Why: Liveness probes trigger container restarts. If liveness checks dependencies, a dependency blip kills your entire fleet.
How to avoid: Liveness checks only "is the process alive" — a trivial 200 response. Readiness checks dependencies and controls traffic routing. Never share the same endpoint.
3. No startup probe on slow-starting apps¶
Your Java app takes 45 seconds to initialize. The liveness probe starts checking at 10 seconds with a 30-second failure threshold. The app gets killed at 40 seconds, restarts, gets killed again. Permanent CrashLoopBackOff.
What happens: The app never finishes booting because liveness kills it mid-startup.
Why: Without a startup probe, liveness begins immediately. Slow apps cannot pass in time.
How to avoid: Add a startup probe with a generous failureThreshold * periodSeconds budget (e.g., 30 x 10 = 300 seconds). Liveness and readiness are suppressed until startup passes.
4. Setting memory limits too low¶
You set limits.memory: 128Mi for a Node.js app. It starts, loads a few requests, allocates memory for a large payload, and gets OOMKilled (exit code 137). Kubernetes restarts it. It loads, gets killed, restarts.
What happens: Persistent CrashLoopBackOff with OOMKilled reason, every few minutes.
Why: The memory limit is a hard ceiling enforced by the kernel. No negotiation.
How to avoid: Profile actual memory usage under realistic load with kubectl top pod before setting limits. Set limits at 1.5-2x observed peak usage.
5. Wrong image tag deployed¶
You push myapp:v2.3.1 but the Dockerfile changed the entrypoint path from /app/server to /usr/local/bin/server. The old deployment spec still references the old path via a command override. Exit code 127.
What happens: Container starts and immediately exits with "command not found." CrashLoopBackOff within seconds.
Why: The image changed but the pod spec was not updated to match. Or the tag points to a completely different image than expected.
How to avoid: Pin images by digest in production. Use imagePullPolicy: Always with mutable tags only in dev. Validate the entrypoint with docker run --rm myapp:v2.3.1 --help before deploying.
6. Secret exists but key name is wrong¶
The secret db-credentials exists but has key database-url instead of url. The pod spec references key url. The env var is empty, the app crashes on startup.
What happens: No CreateContainerConfigError (the secret exists), but the app gets an empty string and crashes. Hard to spot without reading the describe output carefully.
Why: Kubernetes does not fail pod creation when a specific key is missing from a secret — it just sets the env var to empty.
How to avoid: Check kubectl get secret db-credentials -o jsonpath='{.data}' and verify the key names match the pod spec. Use optional: false in env var references.
7. Ignoring the backoff timer¶
You fix the root cause (e.g., create the missing secret) but the pod is in a 5-minute backoff and will not restart for another 4 minutes. You wait, confused, thinking the fix did not work.
What happens: Wasted minutes staring at the pod. The fix is correct but the backoff timer has not expired.
Why: CrashLoopBackOff exponential backoff caps at 5 minutes. The kubelet will not retry until the timer expires.
How to avoid: After fixing the root cause, delete the pod to force an immediate restart: kubectl delete pod <name>. The ReplicaSet or Deployment will create a fresh one with no backoff.
Under the hood: The backoff sequence is 10s, 20s, 40s, 80s, 160s, capping at 300s (5 minutes). The timer resets to 10s only after the container runs successfully for 10 minutes (
restartPolicyreset window). If your app crashes at 9 minutes every time, the backoff never fully resets.
8. Running as root when the image expects non-root¶
Your pod spec sets securityContext.runAsNonRoot: true but the container image runs as root. The container fails to start with a security context error.
What happens: The container is rejected by the kubelet, not by the app. Logs may be empty.
Why: The security context mismatch prevents the container from even starting its entrypoint.
How to avoid: Ensure the Dockerfile has USER <non-root-user> and the pod spec security context matches. Check kubectl describe pod for security-related error messages.
9. Not checking all containers in a multi-container pod¶
Your sidecar container is the one crashing, not the main app. You spend 30 minutes debugging the app container's logs and find nothing wrong.
What happens: Wasted investigation time looking at the wrong container.
Why: kubectl logs <pod> defaults to the first container. CrashLoopBackOff shows on the pod level without indicating which container is failing.
How to avoid: Always check kubectl describe pod to see which container has the restart count. Then kubectl logs <pod> -c <container-name> --previous.
10. Deploying without resource requests¶
No resources.requests set. The pod gets scheduled to an overcommitted node. Under memory pressure, the kernel's OOM killer targets your BestEffort pod first. Random restarts that look nondeterministic.
What happens: Intermittent CrashLoopBackOff that only happens under cluster-wide memory pressure.
Why: Without requests, the pod gets BestEffort QoS and is first in line for eviction.
How to avoid: Always set both requests and limits. Requests determine scheduling and QoS class. Even a rough estimate is better than nothing.