Anti-Primer: Crashloopbackoff¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A developer deploys a new version of the authentication service at 5 PM Friday. The pods immediately go into CrashLoopBackOff. The developer panics and starts making rapid changes to 'fix' it without understanding the root cause.
The Timeline¶
Hour 0: Deploying Without Checking Logs¶
Sees CrashLoopBackOff and immediately rolls back without reading the crash logs. The deadline was looming, and this seemed like the fastest path forward. But the result is rollback masks the real issue: a missing environment variable that will break again on next deploy.
Footgun #1: Deploying Without Checking Logs — sees CrashLoopBackOff and immediately rolls back without reading the crash logs, leading to rollback masks the real issue: a missing environment variable that will break again on next deploy.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Misreading the Backoff Timer¶
Thinks the pod is 'stuck' and keeps deleting it, resetting the backoff timer. Under time pressure, the team chose speed over caution. But the result is each delete restarts the backoff from scratch; the pod never gets enough time to hit the useful log output.
Footgun #2: Misreading the Backoff Timer — thinks the pod is 'stuck' and keeps deleting it, resetting the backoff timer, leading to each delete restarts the backoff from scratch; the pod never gets enough time to hit the useful log output.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Wrong Image Tag¶
Pushed code to main but forgot to build and push the container image. Nobody pushed back because the shortcut looked harmless in the moment. But the result is pod pulls the old image; crashes because the new ConfigMap expects the new code.
Footgun #3: Wrong Image Tag — pushed code to main but forgot to build and push the container image, leading to pod pulls the old image; crashes because the new ConfigMap expects the new code.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: OOM Without Limits¶
App has a memory leak but no memory limits are set; pod is OOMKilled by the node kernel. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is crashLoopBackOff with exit code 137; engineer debugs the wrong thing (app logic instead of memory).
Footgun #4: OOM Without Limits — app has a memory leak but no memory limits are set; pod is OOMKilled by the node kernel, leading to crashLoopBackOff with exit code 137; engineer debugs the wrong thing (app logic instead of memory).
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Deploying Without Checking Logs | Rollback masks the real issue: a missing environment variable that will break again on next deploy | Primer: Always check kubectl logs --previous before taking any action |
| 2 | Misreading the Backoff Timer | Each delete restarts the backoff from scratch; the pod never gets enough time to hit the useful log output | Primer: Let the backoff complete; use --previous flag to see logs from the last crash |
| 3 | Wrong Image Tag | Pod pulls the old image; crashes because the new ConfigMap expects the new code | Primer: CI pipeline should build, push, and deploy as an atomic sequence |
| 4 | OOM Without Limits | CrashLoopBackOff with exit code 137; engineer debugs the wrong thing (app logic instead of memory) | Primer: Set memory limits; check exit codes (137 = OOMKilled, 1 = app error) |
Damage Report¶
- Downtime: 2-4 hours of pod-level or cluster-wide disruption
- Data loss: Risk of volume data loss if StatefulSets were affected
- Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
- Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
- Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on deploying without checking logs, they would have learned: Always check
kubectl logs --previousbefore taking any action. - Footgun #2: If the engineer had read the primer, section on misreading the backoff timer, they would have learned: Let the backoff complete; use --previous flag to see logs from the last crash.
- Footgun #3: If the engineer had read the primer, section on wrong image tag, they would have learned: CI pipeline should build, push, and deploy as an atomic sequence.
- Footgun #4: If the engineer had read the primer, section on oom without limits, they would have learned: Set memory limits; check exit codes (137 = OOMKilled, 1 = app error).
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice