Kubernetes Node Lifecycle Footguns¶
Mistakes that cause stuck drains, workload outages, and failed maintenance windows.
1. Draining a node without checking PodDisruptionBudgets first¶
You run kubectl drain node01 --ignore-daemonsets --delete-emptydir-data. The drain hangs for 45 minutes. A PDB says minAvailable: 1 and the deployment has exactly 1 replica. The drain cannot evict the pod without violating the PDB. Your maintenance window is blown.
Why people do it: kubectl drain feels like a simple operation. PDBs are invisible until they block you. Nobody checks PDBs before draining.
Fix: Before draining, check PDBs: kubectl get pdb --all-namespaces. For each PDB, verify the deployment has enough replicas that eviction is allowed. If a PDB blocks drain, scale up the deployment first, then drain, then scale back down.
2. Using kubectl drain --force to bypass stuck evictions¶
The drain is stuck. You add --force. This deletes pods managed by bare ReplicaSets, standalone pods, and anything without a controller. Those pods are gone permanently -- no controller recreates them. StatefulSet pods with local data lose that data.
Why people do it: The drain is stuck, the maintenance window is closing, and --force is right there in the help text.
Fix: Identify what is blocking the drain: kubectl get pods --field-selector spec.nodeName=<node> -A. Fix the blocker -- scale up a deployment, adjust the PDB, or cordon and manually migrate specific pods. --force is a last resort, not a default.
3. Cordoning a node but forgetting to drain it¶
You run kubectl cordon node01 before maintenance. You think the node is safe to reboot. But cordon only prevents new pods from scheduling -- existing pods keep running. You reboot the node. All running pods die ungracefully. Services with single replicas on that node go down.
Why people do it: "Cordon" sounds like it prepares the node for maintenance. The distinction between "stop scheduling" (cordon) and "move workloads off" (drain) is not intuitive.
Fix: Cordon then drain, always in sequence: kubectl cordon node01 && kubectl drain node01 --ignore-daemonsets --delete-emptydir-data. Never reboot a cordoned-but-undrained node.
Gotcha:
kubectl drainimplicitly cordons the node as its first step. Sokubectl drain node01 --ignore-daemonsets --delete-emptydir-dataalone is sufficient — the explicitcordonbeforedrainis redundant but harmless. The danger is the reverse: cordoning WITHOUT draining, then assuming the node is safe to reboot.
4. Not monitoring kubelet certificate expiration¶
Kubelet client certificates expire. When they do, the node stops communicating with the API server. The node goes NotReady. Pods get evicted after the eviction timeout (default 5 minutes). You see the symptom (NotReady) but not the cause (expired cert) and start debugging network, DNS, or kubelet crashes.
Why people do it: Certificate rotation is supposed to be automatic. It usually is. When it fails (clock skew, RBAC issue, CSR approval backlog), the failure mode is silent until it is too late.
Fix: Monitor certificate expiration: kubeadm certs check-expiration or check the kubelet serving cert directly. Alert when certs are within 30 days of expiry. Verify auto-rotation is working: kubectl get csr should show recently approved CSRs.
5. Rebooting nodes in parallel during OS patching¶
You have 6 nodes. You push a kernel update and reboot all 6 at once (or 3 at once in a "rolling" fashion that is too aggressive). Pods are evicted simultaneously. Anti-affinity rules cannot be satisfied because not enough nodes are available. PDBs are violated. Services go down.
Why people do it: Patching is urgent (CVE). Doing nodes one at a time takes hours. "Rolling reboot" scripts often have insufficient wait-between-nodes.
Fix: Reboot one node at a time. After each node comes back, wait for it to reach Ready state and for all pods to reschedule before proceeding to the next: kubectl wait --for=condition=Ready node/<node> --timeout=300s. Use node upgrade controllers (kured, system-upgrade-controller) that respect PDBs.
6. Ignoring the node NotReady grace period and eviction timeout¶
A node flaps NotReady for 30 seconds due to a transient network issue. You see pods getting evicted and rescheduled. The cluster heals, but now you have double-scheduled pods, confused service discovery, and split-brain stateful workloads. The default eviction timeout (5 minutes with --pod-eviction-timeout) was too aggressive for your environment.
Why people do it: The default eviction timeout is a reasonable starting point. But in environments with flaky networks or slow health checks, the default causes unnecessary churn.
Fix: Tune --pod-eviction-timeout on the controller manager to match your environment. Consider 10 minutes for environments with occasional network blips. For stateful workloads, use tolerations with tolerationSeconds to override the per-pod eviction behavior.
7. Upgrading kubelet without draining the node first¶
You upgrade the kubelet binary on a live node. The kubelet restarts. During the restart window (30-90 seconds), the node reports NotReady. Pods are not gracefully evicted -- they just stop getting health checks. If the kubelet upgrade changes the container runtime interface, running containers may be killed.
Why people do it: "It's just a binary upgrade, it'll restart in seconds." The kubelet usually does come back quickly. But the version skew policy (kubelet must be within 1 minor version of the API server) and runtime compatibility are not checked pre-upgrade.
Fix: Always drain before upgrading kubelet: cordon, drain, upgrade kubelet, restart kubelet, uncordon. Verify the new kubelet version is within the supported skew: kubectl get nodes should show the correct version. One node at a time.
Remember: Kubernetes version skew policy: kubelet can be up to 2 minor versions older than the API server (e.g., API server 1.28, kubelet 1.26). But kubelet can NEVER be newer than the API server. The upgrade order is always: API server first, then kubelets. Downgrading kubelet below the supported skew causes silent failures — nodes appear Ready but pod scheduling and eviction behave unpredictably.
8. Scaling the cluster down without checking resource reservations¶
You remove 2 nodes from a 6-node cluster to save costs. The remaining 4 nodes do not have enough allocatable CPU/memory for all pods. Pods go Pending with Insufficient cpu or Insufficient memory. The cluster autoscaler (if configured) adds nodes back, but now you are in a scaling loop.
Why people do it: The nodes look underutilized in monitoring (30% CPU average). But resource requests (not actual usage) determine scheduling. Requests may reserve 80% of capacity even if actual usage is 30%.
Fix: Before removing nodes, check scheduled vs allocatable: kubectl describe nodes | grep -A5 "Allocated resources". Sum all pod requests across the cluster and verify the remaining nodes have enough headroom. Account for daemonsets, system pods, and scheduling constraints (affinity, taints).
9. Adding nodes to a cluster without matching taints and labels¶
You add new nodes to handle increased load. The new nodes come up clean -- no taints, no labels. Pods that require specific labels (node selectors, affinity rules) cannot schedule on the new nodes. Pods that should NOT run on the new nodes (because of missing taints) flood them. You end up with unbalanced, misconfigured scheduling.
Why people do it: Node provisioning automation sets up kubelet but does not replicate the taint/label configuration from existing nodes. The node joins the cluster and "looks healthy."
Fix: Automate taint and label application as part of node bootstrap. Use a node-labeling daemonset or cloud-init/userdata script. Verify with kubectl get nodes --show-labels and kubectl describe node <new-node> | grep Taints immediately after join.
10. Deleting a node object from the API server before cleaning up pods¶
You run kubectl delete node <node> to remove a dead node. The node object disappears from the API. But the pods that were on that node are now in Terminating state forever -- the kubelet that would finalize the termination no longer exists. Deployments try to schedule replacements but pod finalizers block cleanup. You end up with ghost pods consuming resource quota.
Why people do it: The node is truly dead (hardware failure, terminated VM). Deleting the node object feels like cleanup.
Fix: Force-delete the pods first: kubectl get pods --field-selector spec.nodeName=<node> -A -o name | xargs kubectl delete --grace-period=0 --force. Then delete the node object. Or use kubectl drain --force --timeout=60s first, then delete the node object.
Debug clue: Ghost pods in
Terminatingstate from deleted nodes consume ResourceQuota but are never scheduled. Check for them withkubectl get pods -A --field-selector status.phase=Running | grep Terminating. If the node object is already deleted, the only cleanup iskubectl delete pod <name> --grace-period=0 --force -n <ns>. StatefulSets are especially affected — they will not create a replacement pod until the old one is fully terminated.