Node Maintenance Footguns¶
Mistakes that cause dropped traffic, stuck drains, or cluster-wide disruption during routine node operations.
1. Draining without checking PDB headroom¶
You drain a node without checking PodDisruptionBudgets. The drain hangs indefinitely because a PDB requires minAvailable: 2 and only 2 replicas exist — evicting one would violate the budget.
What happens: The drain blocks forever. Your maintenance window passes with nothing accomplished.
Why: Drains respect PDBs. If evicting a pod would violate a PDB, the drain waits until the budget allows it.
How to avoid: Check kubectl get pdb -A before draining. If ALLOWED DISRUPTIONS is 0, either scale up the workload first or coordinate with the application team.
Debug clue: When a drain hangs,
kubectl get events -A --field-selector reason=EvictionFailedshows which PDB is blocking. The event message includes the PDB name and the current/required counts. IfALLOWED DISRUPTIONSshows 0 for a PDB, the formula iscurrentHealthy - minAvailable(orreplicas - maxUnavailable). Scale up the deployment to create headroom before draining.
2. Forgetting --ignore-daemonsets¶
You run kubectl drain worker-03 without --ignore-daemonsets. The drain immediately errors out because DaemonSet pods cannot be evicted (they are managed per-node).
What happens: Drain fails with an error about DaemonSet-managed pods.
Why: DaemonSet pods are supposed to run on every node. Evicting them would violate their contract.
How to avoid: Always include --ignore-daemonsets. DaemonSet pods will be terminated when the node shuts down and recreated when it comes back.
3. Draining multiple nodes simultaneously¶
You drain 3 out of 5 worker nodes at once to save time. All pods get squeezed onto 2 nodes. Those nodes run out of resources. Pods go Pending. Services are degraded.
What happens: Cluster capacity drops below what workloads need. Cascading scheduling failures.
Why: Draining removes all non-DaemonSet pods. If remaining nodes cannot absorb the evicted pods, workloads are disrupted.
How to avoid: Drain one node at a time. Wait for pods to reschedule and pass health checks before moving to the next node. Automate with a rolling script that verifies between each node.
4. Not using --timeout on drain¶
You drain a node and walk away. A pod with a long terminationGracePeriodSeconds (600 seconds) or a stuck finalizer blocks the drain for 10+ minutes. You come back to find the maintenance incomplete.
What happens: Drain hangs indefinitely on a single stuck pod.
Why: Without --timeout, drain waits forever for all pods to terminate gracefully.
How to avoid: Always set --timeout=300s (or appropriate for your workloads). If the drain times out, investigate the stuck pod rather than forcing it blindly.
Gotcha: Pods with finalizers can also block drains indefinitely. Even with
--timeout, a pod whose finalizer controller is broken will stay inTerminatingstate. Check withkubectl get pod <pod> -o jsonpath='{.metadata.finalizers}'. If the finalizer controller is dead, you may need to remove the finalizer manually withkubectl patch— but understand what the finalizer was protecting first.
5. Using --force without understanding what it does¶
You add --force to get past a drain failure. --force evicts pods not managed by a ReplicaSet or StatefulSet (standalone pods). Those pods are permanently deleted — no controller recreates them.
What happens: Standalone pods are killed and never come back. If they were running critical one-off tasks, that work is lost.
Why: --force allows eviction of unmanaged pods, which are not recreated by any controller.
How to avoid: Check what standalone pods exist before using --force. If they are important, migrate them manually first. Or ensure all workloads are managed by controllers.
6. Skipping the dry-run¶
You drain a node in production without a dry run. The drain evicts a critical single-replica pod that has no PDB, causing a brief outage.
What happens: Unexpected eviction of pods you did not know were on that node.
Why: You did not check what would be evicted before evicting it.
How to avoid: Always run kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=client first. Review the list of pods that would be evicted.
7. Forgetting to uncordon after maintenance¶
You cordon and drain a node, perform maintenance, and forget to uncordon. The node stays SchedulingDisabled for days. Cluster capacity is reduced. Other nodes get overloaded. Nobody notices until scheduling pressure causes Pending pods.
What happens: Wasted cluster capacity. Increased pressure on remaining nodes.
Why: Cordoning is a manual state that persists until explicitly reversed.
How to avoid: Include uncordon in your maintenance script. Set a reminder. Monitor for nodes in SchedulingDisabled state.
8. Not waiting for pods to reschedule between nodes¶
You drain node A, immediately drain node B. Pods evicted from A are still starting up on other nodes. Some have not passed readiness checks. Now pods from B are also being evicted. Total available capacity drops dangerously.
What happens: Service disruption from too many pods in transition simultaneously.
Why: Pod rescheduling takes time — image pulls, startup probes, readiness checks. Draining too fast outpaces the cluster's ability to absorb.
How to avoid: After uncordoning each node, wait for all pods to be Running and Ready before proceeding to the next. A 60-second pause between nodes is a reasonable minimum.
9. Control plane maintenance without checking etcd quorum¶
You drain a control plane node in a 3-node cluster without checking etcd health. A second etcd member happens to be unhealthy. Draining the node takes etcd below quorum. The API server becomes unavailable.
What happens: Complete cluster control plane outage.
Why: etcd requires a majority of members (2 of 3) for quorum. If one was already down, draining another loses quorum.
How to avoid: Always check etcdctl endpoint health --cluster before maintaining a control plane node. Never drain more than one control plane node at a time.
Remember: etcd quorum math: 3 members need 2 for quorum (tolerate 1 loss), 5 need 3 (tolerate 2). A cluster with 1 unhealthy member plus 1 drained member equals no quorum. Always verify health of ALL members before draining ANY control plane node:
etcdctl endpoint status --cluster -w table.