Skip to content

Node Maintenance - Primer

Why This Matters

Kubernetes nodes need patching, upgrading, and occasional hardware replacement. Do it wrong and you drop production traffic, violate pod disruption budgets, or leave the cluster in a degraded state with unschedulable workloads. Node maintenance is one of the most common operational tasks in Kubernetes — and one of the most common sources of avoidable outages when done carelessly.

Core Concepts

1. Cordon, Drain, Uncordon — The Maintenance Lifecycle

Every node maintenance follows the same three-step pattern:

# Step 1: Cordon — mark node as unschedulable (no new pods land here)
kubectl cordon worker-03

# Step 2: Drain — evict all pods gracefully
kubectl drain worker-03 --ignore-daemonsets --delete-emptydir-data --grace-period=120

# Step 3: Perform maintenance (OS patch, kubelet upgrade, hardware swap)

# Step 4: Uncordon — mark node as schedulable again
kubectl uncordon worker-03

Check node status at each step:

kubectl get nodes
# worker-03   Ready,SchedulingDisabled   <none>   45d   v1.28.3

The SchedulingDisabled taint means the node is cordoned. Existing pods keep running until drained.

Remember: The node maintenance mantra: "CDC" — Cordon, Drain, unCordon. Always in this order. Cordoning first prevents new pods from landing while you prepare the drain. Draining second evicts existing pods gracefully. Uncordoning last returns the node to service. Skipping the cordon step means new pods can land on the node between your drain and your maintenance — defeating the purpose.

2. Drain Flags That Matter

# Basic drain (will fail if there are pods not managed by a controller)
kubectl drain worker-03

# Production drain — handle DaemonSets, emptyDir, and local data
kubectl drain worker-03 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=120 \
  --timeout=300s \
  --force

# Dry run first — see what would be evicted
kubectl drain worker-03 --ignore-daemonsets --delete-emptydir-data --dry-run=client
Flag Purpose
--ignore-daemonsets Skip DaemonSet pods (they run on every node by design)
--delete-emptydir-data Allow eviction of pods using emptyDir volumes (data will be lost)
--grace-period=N Seconds to wait for graceful pod shutdown
--timeout=N Abort drain if it takes longer than N seconds
--force Evict pods not managed by ReplicaSet/Job/DaemonSet/StatefulSet
--pod-selector Only drain pods matching a label selector

3. PodDisruptionBudgets (PDBs)

PDBs tell Kubernetes how many pods of a given set must remain available during voluntary disruptions (like drains):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2          # At least 2 pods must stay running
  selector:
    matchLabels:
      app: api-server
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: worker-pdb
spec:
  maxUnavailable: 1        # At most 1 pod can be down at a time
  selector:
    matchLabels:
      app: background-worker
# Check PDB status before draining
kubectl get pdb -A
# NAME       MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# api-pdb    2               N/A               1                     30d

# If ALLOWED DISRUPTIONS is 0, the drain will block until a pod becomes available

A drain that violates a PDB will hang (not fail) until the budget allows eviction. This is by design — it prevents you from accidentally taking down too many replicas.

4. DaemonSet Implications

DaemonSets run one pod per node. During node maintenance:

  • --ignore-daemonsets skips them during drain (they will be terminated when the node shuts down)
  • DaemonSet pods automatically re-create when the node comes back
  • If a DaemonSet uses hostPath volumes, data persists across pod restarts on the same node
# Check which DaemonSets are running on the target node
kubectl get pods -A --field-selector spec.nodeName=worker-03 | grep -i daemon

# Common DaemonSets you will see
# - kube-proxy
# - calico-node / cilium
# - fluent-bit / fluentd (logging)
# - node-exporter (monitoring)

Gotcha: A PDB with minAvailable equal to the total replica count (e.g., minAvailable: 3 on a 3-replica Deployment) will block every drain indefinitely — there is no headroom for eviction. This is one of the most common "drain is stuck" root causes. Always set minAvailable to at least one less than the replica count, or use maxUnavailable: 1 instead.

War story: A team ran a rolling OS upgrade script across 20 nodes without checking PDB headroom first. The script cordoned and drained 3 nodes simultaneously, but a critical service had maxUnavailable: 1. The second and third drains hung, the script stalled, and three nodes were cordoned but not maintained — reducing cluster capacity by 15% for hours until someone noticed.

5. Node Upgrades (kubelet + OS)

# On the node (after cordon + drain):

# Update kubelet and kubectl
apt-get update && apt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1
apt-mark hold kubelet kubectl

# Restart kubelet
systemctl daemon-reload
systemctl restart kubelet

# Verify kubelet is running
systemctl status kubelet
journalctl -u kubelet --no-pager -n 50

# Back on the control plane — uncordon
kubectl uncordon worker-03
kubectl get nodes

For OS kernel upgrades:

# On the node
apt-get update && apt-get upgrade -y
# If kernel was updated:
reboot

# After reboot — verify node rejoins cluster
kubectl get nodes -w
# Then uncordon
kubectl uncordon worker-03

6. etcd Member Removal (Control Plane Maintenance)

For control plane nodes running etcd:

# List etcd members
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Remove a member (use the member ID from the list)
ETCDCTL_API=3 etcdctl member remove <MEMBER_ID> \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify cluster health after removal
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Never remove more than one etcd member at a time. A 3-member cluster can tolerate one failure. Removing two members loses quorum.

7. Rolling Node Maintenance

For maintaining multiple nodes safely:

#!/usr/bin/env bash
set -euo pipefail

NODES=$(kubectl get nodes -l role=worker -o jsonpath='{.items[*].metadata.name}')

for node in ${NODES}; do
    echo "=== Maintaining ${node} ==="

    # Check PDB headroom before starting
    kubectl get pdb -A -o jsonpath='{range .items[*]}{.metadata.name}: allowed={.status.disruptionsAllowed}{"\n"}{end}'

    kubectl cordon "${node}"
    kubectl drain "${node}" --ignore-daemonsets --delete-emptydir-data --timeout=300s

    # Perform maintenance via SSH
    ssh "${node}" 'apt-get update && apt-get upgrade -y && reboot'

    # Wait for node to come back
    echo "Waiting for ${node} to rejoin..."
    until kubectl get node "${node}" | grep -q " Ready"; do
        sleep 10
    done

    kubectl uncordon "${node}"
    echo "=== ${node} complete ==="

    # Wait for pods to reschedule before moving to next node
    sleep 60
done

8. Troubleshooting Stuck Drains

# Find pods blocking the drain
kubectl get pods -A --field-selector spec.nodeName=worker-03

# Check for pods without controllers (standalone pods)
kubectl get pods -A --field-selector spec.nodeName=worker-03 -o json | \
  jq '.items[] | select(.metadata.ownerReferences == null) | .metadata.name'

# Check PDB status — is the drain blocked by a budget?
kubectl get pdb -A

# Force-delete a stuck pod (last resort)
kubectl delete pod stuck-pod -n production --grace-period=0 --force

# Check for finalizers blocking pod deletion
kubectl get pod stuck-pod -n production -o jsonpath='{.metadata.finalizers}'

One-liner: kubectl drain node --dry-run=client --ignore-daemonsets --delete-emptydir-data — always dry-run first. It lists exactly which pods will be evicted and which will block the drain, without actually evicting anything. This 5-second check prevents hours of stuck-drain debugging.

Key Takeaway

Node maintenance follows a predictable lifecycle: cordon, drain, maintain, uncordon. The complexity comes from PDBs, DaemonSets, standalone pods, and stateful workloads. Always dry-run drains first, check PDB headroom, and never rush through multiple nodes without verifying pod rescheduling between each one.


Wiki Navigation