Incident Replay: CrashLoopBackOff with No Logs¶
Setup¶
- System context: Production microservice deployment in Kubernetes. A new version was deployed and pods are in CrashLoopBackOff. The container produces no logs before crashing.
- Time: Monday 09:45 UTC
- Your role: On-call SRE / application platform engineer
Round 1: Alert Fires¶
[Pressure cue: "Deploy of payment-service v2.4.1 failed. All 3 replicas in CrashLoopBackOff. Previous version was auto-rolled-back but the deployment is blocking the pipeline. Team needs the fix."]
What you see:
kubectl get pods shows 3 pods in CrashLoopBackOff with 5 restarts. kubectl logs payment-service-xxxxx returns empty — no output at all. kubectl logs --previous payment-service-xxxxx also returns empty.
Choose your action:
- A) Check the pod events with kubectl describe pod
- B) Exec into the container to debug interactively
- C) Check the container image for the correct entrypoint
- D) Increase the restart backoff limit and wait for logs to appear
If you chose A (recommended):¶
[Result:
kubectl describe podshows events: "Started container payment-service" followed immediately by "Back-off restarting failed container." Last state: Terminated, Exit Code: 137 (SIGKILL), Reason: OOMKilled. The container is being OOM-killed before it can produce any output. Proceed to Round 2.]
If you chose B:¶
[Result: You cannot exec into a CrashLoopBackOff container — it exits before you can connect. Need a different approach.]
If you chose C:¶
[Result: Image entrypoint is correct (
java -jar app.jar). The binary exists and is executable. Not an entrypoint issue.]
If you chose D:¶
[Result: More restarts produce the same result — no logs, immediate crash. Waiting does not help.]
Round 2: First Triage Data¶
[Pressure cue: "Payment processing is on the old version. The new version has a critical bug fix the business needs."]
What you see: The pod is OOMKilled with exit code 137. Container memory limit is 512Mi. The v2.4.1 release notes mention "added new caching layer for improved performance." The Java heap is not explicitly set, defaulting to container-aware JVM ergonomics.
Choose your action:
- A) Increase the container memory limit to 1Gi
- B) Set explicit JVM heap flags: -Xmx384m -Xms256m
- C) Remove the memory limit to let the container use as much as it needs
- D) Check what changed in v2.4.1 that increased memory usage
If you chose D (recommended):¶
[Result: The diff between v2.4.0 and v2.4.1 shows a new in-memory cache (Caffeine) with no max-size configured. It loads the entire product catalog (~800MB) into memory on startup. The JVM tries to allocate 800MB + overhead, exceeding the 512Mi limit. Proceed to Round 3.]
If you chose A:¶
[Result: 1Gi might work but you are just giving it more rope. If the cache grows with data, it will OOM again. Need to fix the cache config.]
If you chose B:¶
[Result: Setting a 384m heap does not help — the cache needs 800MB. The heap would be too small and the app would throw OutOfMemoryError in Java instead.]
If you chose C:¶
[Result: Removing memory limits is dangerous — one pod can starve the entire node. Never do this in production.]
Round 3: Root Cause Identification¶
[Pressure cue: "Found the unbounded cache. Fix it."]
What you see:
Root cause: New Caffeine cache has no maximumSize or maximumWeight configured. It attempts to load the full product catalog into memory on startup. The container's 512Mi limit is exceeded before the JVM even finishes startup, hence no logs.
Choose your action:
- A) Add maximumSize(10000) to the cache config and redeploy
- B) Increase memory limit to 2Gi and add cache bounds
- C) Add cache bounds and increase memory limit to 768Mi for headroom
- D) Revert to v2.4.0 while the cache issue is fixed
If you chose C (recommended):¶
[Result: Cache configured with
maximumSize(5000)(estimated ~200MB). Memory limit increased to 768Mi for headroom. JVM flags set to-Xmx512m. Deploy succeeds — pods Running, no OOM. Proceed to Round 4.]
If you chose A:¶
[Result: Cache bounds help but 512Mi may still be tight with the bounded cache + JVM overhead. Slight risk.]
If you chose B:¶
[Result: 2Gi is excessive for this service. Wastes cluster resources.]
If you chose D:¶
[Result: Safe but the bug fix in v2.4.1 is needed by the business. A config fix is faster than a rollback + new release.]
Round 4: Remediation¶
[Pressure cue: "Payment service v2.4.1 is running. Verify and close."]
Actions:
1. Verify pods are stable: kubectl get pods — Running, 0 restarts for 10+ minutes
2. Verify memory usage: kubectl top pod — well within limits
3. Verify payment processing works end-to-end
4. Add a CI check that rejects unbounded caches in code review
5. Add memory usage alerting at 80% of container limit
Damage Report¶
- Total downtime: 0 (auto-rollback kept old version running)
- Blast radius: Deployment pipeline blocked for 2 hours; new bug fix delayed
- Optimal resolution time: 20 minutes (describe pod -> identify OOM -> find cache -> fix config -> redeploy)
- If every wrong choice was made: 4+ hours including debugging without logs and removing memory limits
Cross-References¶
- Primer: CrashLoopBackOff
- Primer: OOMKilled
- Primer: Kubernetes Ops
- Footguns: Kubernetes Ops