Anti-Primer: Kubernetes Node Lifecycle¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A cluster admin is performing a rolling node upgrade to patch a kernel CVE. The cluster runs 50 nodes with a mix of stateful and stateless workloads. The admin starts draining nodes at 2 PM on a Tuesday.
The Timeline¶
Hour 0: Draining Without PDBs¶
Runs kubectl drain on nodes hosting the only replicas of a stateful service without PDBs. The deadline was looming, and this seemed like the fastest path forward. But the result is all pods for the payment database proxy are evicted simultaneously; transactions fail for 5 minutes.
Footgun #1: Draining Without PDBs — runs
kubectl drainon nodes hosting the only replicas of a stateful service without PDBs, leading to all pods for the payment database proxy are evicted simultaneously; transactions fail for 5 minutes.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Cordoning But Not Draining¶
Cordons a node thinking it will gracefully migrate pods; pods stay running on the cordoned node. Under time pressure, the team chose speed over caution. But the result is when the node is shut down for patching, pods are killed abruptly without graceful shutdown.
Footgun #2: Cordoning But Not Draining — cordons a node thinking it will gracefully migrate pods; pods stay running on the cordoned node, leading to when the node is shut down for patching, pods are killed abruptly without graceful shutdown.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Ignoring DaemonSet Pods¶
Forgets that DaemonSet pods (logging, monitoring) need --ignore-daemonsets flag to drain. Nobody pushed back because the shortcut looked harmless in the moment. But the result is drain command hangs indefinitely; engineer force-kills it and reboots the node.
Footgun #3: Ignoring DaemonSet Pods — forgets that DaemonSet pods (logging, monitoring) need --ignore-daemonsets flag to drain, leading to drain command hangs indefinitely; engineer force-kills it and reboots the node.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: No Pod Grace Period Respect¶
Sets --grace-period=0 on drain to speed up the maintenance window. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is in-flight requests are dropped; database connections are not closed cleanly; data corruption.
Footgun #4: No Pod Grace Period Respect — sets --grace-period=0 on drain to speed up the maintenance window, leading to in-flight requests are dropped; database connections are not closed cleanly; data corruption.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | Draining Without PDBs | All pods for the payment database proxy are evicted simultaneously; transactions fail for 5 minutes | Primer: Create PodDisruptionBudgets before any drain operation |
| 2 | Cordoning But Not Draining | When the node is shut down for patching, pods are killed abruptly without graceful shutdown | Primer: Cordon then drain; cordon alone only prevents new scheduling |
| 3 | Ignoring DaemonSet Pods | Drain command hangs indefinitely; engineer force-kills it and reboots the node | Primer: Use --ignore-daemonsets --delete-emptydir-data flags for node drain |
| 4 | No Pod Grace Period Respect | In-flight requests are dropped; database connections are not closed cleanly; data corruption | Primer: Respect pod terminationGracePeriodSeconds; plan maintenance windows accordingly |
Damage Report¶
- Downtime: 2-4 hours of pod-level or cluster-wide disruption
- Data loss: Risk of volume data loss if StatefulSets were affected
- Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
- Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
- Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on draining without pdbs, they would have learned: Create PodDisruptionBudgets before any drain operation.
- Footgun #2: If the engineer had read the primer, section on cordoning but not draining, they would have learned: Cordon then drain; cordon alone only prevents new scheduling.
- Footgun #3: If the engineer had read the primer, section on ignoring daemonset pods, they would have learned: Use --ignore-daemonsets --delete-emptydir-data flags for node drain.
- Footgun #4: If the engineer had read the primer, section on no pod grace period respect, they would have learned: Respect pod terminationGracePeriodSeconds; plan maintenance windows accordingly.
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice