Anti-Primer: Kubernetes Debugging Playbook¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

A platform engineer is debugging a failing deployment at 3 AM during an on-call shift. The pod is stuck in CrashLoopBackOff and the engineer starts making changes without following the team's debugging playbook.

The Timeline¶

Hour 0: Deleting the Crashing Pod First¶

Immediately deletes the crashing pod instead of capturing logs and describe output. The deadline was looming, and this seemed like the fastest path forward. But the result is all diagnostic evidence is lost; the new pod crashes identically but now there is no history to compare.

Footgun #1: Deleting the Crashing Pod First — immediately deletes the crashing pod instead of capturing logs and describe output, leading to all diagnostic evidence is lost; the new pod crashes identically but now there is no history to compare.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Exec Into Wrong Pod¶

Execs into the wrong replica and makes config changes that affect healthy traffic. Under time pressure, the team chose speed over caution. But the result is the healthy pod starts misbehaving; now two pods are broken instead of one.

Footgun #2: Exec Into Wrong Pod — execs into the wrong replica and makes config changes that affect healthy traffic, leading to the healthy pod starts misbehaving; now two pods are broken instead of one.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Ignoring Events¶

Looks only at pod logs but misses namespace events showing a failing init container. Nobody pushed back because the shortcut looked harmless in the moment. But the result is spends 2 hours debugging the main container when the init container is the actual failure point.

Footgun #3: Ignoring Events — looks only at pod logs but misses namespace events showing a failing init container, leading to spends 2 hours debugging the main container when the init container is the actual failure point.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Applying Fixes to the Wrong Namespace¶

Applies a ConfigMap fix to default namespace instead of production namespace. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is fix has no effect; engineer concludes the fix is wrong and tries increasingly drastic changes.

Footgun #4: Applying Fixes to the Wrong Namespace — applies a ConfigMap fix to default namespace instead of production namespace, leading to fix has no effect; engineer concludes the fix is wrong and tries increasingly drastic changes.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	Deleting the Crashing Pod First	All diagnostic evidence is lost; the new pod crashes identically but now there is no history to compare	Primer: Always capture logs and events before taking any remediation action
2	Exec Into Wrong Pod	The healthy pod starts misbehaving; now two pods are broken instead of one	Primer: Verify pod name and namespace before exec; use labels to identify the exact failing pod
3	Ignoring Events	Spends 2 hours debugging the main container when the init container is the actual failure point	Primer: Always check `kubectl get events --sort-by=.lastTimestamp` as a first step
4	Applying Fixes to the Wrong Namespace	Fix has no effect; engineer concludes the fix is wrong and tries increasingly drastic changes	Primer: Always specify --namespace explicitly; use kubectx/kubens for safety

Damage Report¶

Downtime: 2-4 hours of pod-level or cluster-wide disruption
Data loss: Risk of volume data loss if StatefulSets were affected
Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on deleting the crashing pod first, they would have learned: Always capture logs and events before taking any remediation action.
Footgun #2: If the engineer had read the primer, section on exec into wrong pod, they would have learned: Verify pod name and namespace before exec; use labels to identify the exact failing pod.
Footgun #3: If the engineer had read the primer, section on ignoring events, they would have learned: Always check kubectl get events --sort-by=.lastTimestamp as a first step.
Footgun #4: If the engineer had read the primer, section on applying fixes to the wrong namespace, they would have learned: Always specify --namespace explicitly; use kubectx/kubens for safety.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice