Skip to content

Node Maintenance Footguns

Mistakes that cause dropped traffic, stuck drains, or cluster-wide disruption during routine node operations.


1. Draining without checking PDB headroom

You drain a node without checking PodDisruptionBudgets. The drain hangs indefinitely because a PDB requires minAvailable: 2 and only 2 replicas exist — evicting one would violate the budget.

What happens: The drain blocks forever. Your maintenance window passes with nothing accomplished.

Why: Drains respect PDBs. If evicting a pod would violate a PDB, the drain waits until the budget allows it.

How to avoid: Check kubectl get pdb -A before draining. If ALLOWED DISRUPTIONS is 0, either scale up the workload first or coordinate with the application team.

Debug clue: When a drain hangs, kubectl get events -A --field-selector reason=EvictionFailed shows which PDB is blocking. The event message includes the PDB name and the current/required counts. If ALLOWED DISRUPTIONS shows 0 for a PDB, the formula is currentHealthy - minAvailable (or replicas - maxUnavailable). Scale up the deployment to create headroom before draining.


2. Forgetting --ignore-daemonsets

You run kubectl drain worker-03 without --ignore-daemonsets. The drain immediately errors out because DaemonSet pods cannot be evicted (they are managed per-node).

What happens: Drain fails with an error about DaemonSet-managed pods.

Why: DaemonSet pods are supposed to run on every node. Evicting them would violate their contract.

How to avoid: Always include --ignore-daemonsets. DaemonSet pods will be terminated when the node shuts down and recreated when it comes back.


3. Draining multiple nodes simultaneously

You drain 3 out of 5 worker nodes at once to save time. All pods get squeezed onto 2 nodes. Those nodes run out of resources. Pods go Pending. Services are degraded.

What happens: Cluster capacity drops below what workloads need. Cascading scheduling failures.

Why: Draining removes all non-DaemonSet pods. If remaining nodes cannot absorb the evicted pods, workloads are disrupted.

How to avoid: Drain one node at a time. Wait for pods to reschedule and pass health checks before moving to the next node. Automate with a rolling script that verifies between each node.


4. Not using --timeout on drain

You drain a node and walk away. A pod with a long terminationGracePeriodSeconds (600 seconds) or a stuck finalizer blocks the drain for 10+ minutes. You come back to find the maintenance incomplete.

What happens: Drain hangs indefinitely on a single stuck pod.

Why: Without --timeout, drain waits forever for all pods to terminate gracefully.

How to avoid: Always set --timeout=300s (or appropriate for your workloads). If the drain times out, investigate the stuck pod rather than forcing it blindly.

Gotcha: Pods with finalizers can also block drains indefinitely. Even with --timeout, a pod whose finalizer controller is broken will stay in Terminating state. Check with kubectl get pod <pod> -o jsonpath='{.metadata.finalizers}'. If the finalizer controller is dead, you may need to remove the finalizer manually with kubectl patch — but understand what the finalizer was protecting first.


5. Using --force without understanding what it does

You add --force to get past a drain failure. --force evicts pods not managed by a ReplicaSet or StatefulSet (standalone pods). Those pods are permanently deleted — no controller recreates them.

What happens: Standalone pods are killed and never come back. If they were running critical one-off tasks, that work is lost.

Why: --force allows eviction of unmanaged pods, which are not recreated by any controller.

How to avoid: Check what standalone pods exist before using --force. If they are important, migrate them manually first. Or ensure all workloads are managed by controllers.


6. Skipping the dry-run

You drain a node in production without a dry run. The drain evicts a critical single-replica pod that has no PDB, causing a brief outage.

What happens: Unexpected eviction of pods you did not know were on that node.

Why: You did not check what would be evicted before evicting it.

How to avoid: Always run kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=client first. Review the list of pods that would be evicted.


7. Forgetting to uncordon after maintenance

You cordon and drain a node, perform maintenance, and forget to uncordon. The node stays SchedulingDisabled for days. Cluster capacity is reduced. Other nodes get overloaded. Nobody notices until scheduling pressure causes Pending pods.

What happens: Wasted cluster capacity. Increased pressure on remaining nodes.

Why: Cordoning is a manual state that persists until explicitly reversed.

How to avoid: Include uncordon in your maintenance script. Set a reminder. Monitor for nodes in SchedulingDisabled state.


8. Not waiting for pods to reschedule between nodes

You drain node A, immediately drain node B. Pods evicted from A are still starting up on other nodes. Some have not passed readiness checks. Now pods from B are also being evicted. Total available capacity drops dangerously.

What happens: Service disruption from too many pods in transition simultaneously.

Why: Pod rescheduling takes time — image pulls, startup probes, readiness checks. Draining too fast outpaces the cluster's ability to absorb.

How to avoid: After uncordoning each node, wait for all pods to be Running and Ready before proceeding to the next. A 60-second pause between nodes is a reasonable minimum.


9. Control plane maintenance without checking etcd quorum

You drain a control plane node in a 3-node cluster without checking etcd health. A second etcd member happens to be unhealthy. Draining the node takes etcd below quorum. The API server becomes unavailable.

What happens: Complete cluster control plane outage.

Why: etcd requires a majority of members (2 of 3) for quorum. If one was already down, draining another loses quorum.

How to avoid: Always check etcdctl endpoint health --cluster before maintaining a control plane node. Never drain more than one control plane node at a time.

Remember: etcd quorum math: 3 members need 2 for quorum (tolerate 1 loss), 5 need 3 (tolerate 2). A cluster with 1 unhealthy member plus 1 drained member equals no quorum. Always verify health of ALL members before draining ANY control plane node: etcdctl endpoint status --cluster -w table.