Skip to content

Portal | Level: L2: Operations | Topics: Node Lifecycle & Maintenance | Domain: Kubernetes

Kubernetes Node Lifecycle - Primer

Why This Matters

Nodes are where your pods actually run. When a node fails, gets patched, or needs an OS upgrade, every pod on it is affected. The Kubernetes model treats nodes as cattle, but workloads expect continuity.

The gap between theory and "this drain has been stuck for 45 minutes" is where this topic lives. Most node incidents come from three areas: nodes going NotReady, drains stuck on PodDisruptionBudgets, and DaemonSets blocking eviction.

Under the hood: The kubelet sends heartbeats to the API server via NodeStatus updates (default every 10 seconds) and Lease objects (default every 10 seconds). The node controller uses a node-monitor-grace-period (default 40 seconds) — if no heartbeat arrives in that window, the node is marked NotReady. After pod-eviction-timeout (default 5 minutes), pods are evicted. These timers directly affect your failover speed.

Core Concepts

1. Node States and Conditions

The kubelet reports conditions via heartbeat:

Condition Meaning
Ready Kubelet healthy, accepts pods
NotReady Kubelet unhealthy or unreachable
SchedulingDisabled Cordoned, no new pods
MemoryPressure Node low on memory
DiskPressure Node low on disk
kubectl get nodes
kubectl describe node <name> | grep -A5 Conditions

When a node goes NotReady, the node controller waits pod-eviction-timeout (default 5m) before evicting pods. During this window, pods may still be running but unreachable -- split brain risk.

2. Kubelet Registration

Debug clue: If a new node never appears in kubectl get nodes, check the kubelet logs first: journalctl -u kubelet -f. The three most common registration failures are: (1) the kubelet cannot reach the API server (firewall, wrong API endpoint in kubelet config), (2) TLS certificate issues (bootstrap token expired, clock skew causing cert validation failure), and (3) hostname collision (two nodes registering with the same name — the second one fails silently).

On startup, the kubelet registers with the API server (name, resources, labels, taints). If it fails, the node never appears. Common causes: cannot reach API server, certificate issues, hostname collision.

systemctl status kubelet
journalctl -u kubelet -f

3. Taints and Tolerations

Analogy: Think of taints as a "No Trespassing" sign on a node, and tolerations as a permission slip that lets specific pods ignore the sign. The effect (NoSchedule, PreferNoSchedule, NoExecute) determines how aggressively the sign is enforced — from "please avoid" to "get out now."

Taints on nodes repel pods. Tolerations on pods let them schedule on tainted nodes.

kubectl taint nodes node1 maintenance=true:NoSchedule
kubectl taint nodes node1 maintenance=true:NoSchedule-
Effect New Pods Existing Pods
NoSchedule Blocked Unaffected
PreferNoSchedule Avoided Unaffected
NoExecute Blocked Evicted

4. Cordoning and Draining

Cordoning stops new pods. Draining evicts existing pods.

Remember: The drain workflow mnemonic: "CDC" — Cordon (stop new pods), Drain (evict existing pods), unCordon (allow pods again). Always cordon before drain. Drain without cordon still works, but new pods might land on the node while you are draining it.

kubectl cordon node1
kubectl drain node1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s
kubectl uncordon node1
Flag Purpose
--ignore-daemonsets Skip DaemonSet pods
--delete-emptydir-data Allow emptyDir deletion
--force Delete unmanaged pods
--timeout Abort if drain takes too long

5. PodDisruptionBudgets (PDBs)

PDBs declare how many pods must remain available during voluntary disruptions (drains).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: myapp

PDBs are the #1 cause of stuck drains. If you have 3 replicas and minAvailable: 3, no pod can ever be evicted. The drain hangs forever.

Gotcha: A PDB with minAvailable: 100% or maxUnavailable: 0 is a foot-gun that blocks all voluntary disruptions, including node upgrades, autoscaler scale-downs, and kubectl drain. Always audit PDBs before starting maintenance: kubectl get pdb -A -o wide and check the "Allowed Disruptions" column. Zero means drain will hang.

kubectl get pdb -A
kubectl describe pdb <name>
# "Allowed Disruptions: 0" means drain will block

6. DaemonSets During Drain

DaemonSets run one pod per node. Drain skips them with --ignore-daemonsets because they would just be recreated on the same node.

7. Node Upgrade Workflow

kubectl cordon node1
kubectl drain node1 \
  --ignore-daemonsets --delete-emptydir-data
# SSH to node, upgrade kubelet/kubectl
systemctl daemon-reload && systemctl restart kubelet
# Wait for Ready
kubectl uncordon node1

In managed Kubernetes (EKS/GKE/AKS), upgrades often mean replacing the node entirely: drain, terminate, let autoscaler provision a new instance.

8. Node Auto-Repair

War story: A common production surprise: GKE auto-repair replaces a NotReady node by terminating the VM and creating a new one. If the node had local SSDs with ephemeral data (e.g., a caching tier), that data is gone. Auto-repair is a feature, not a backup strategy. Any workload on auto-repaired nodes must tolerate complete node replacement.

Cloud providers detect NotReady nodes and recreate them (GKE automatic, EKS via ASG health checks, AKS automatic). On bare metal, monitor NotReady duration and alert. Node Problem Detector surfaces hardware and kernel issues as node conditions.

What Experienced People Know

  • Always set --timeout on drain commands. A stuck drain with no timeout hangs automation forever.
  • PDBs with minAvailable equal to replica count are a time bomb. Use maxUnavailable: 1 instead.
  • Check PDBs before starting maintenance, not after.
  • Pods with long terminationGracePeriodSeconds hold drain for that duration per pod.
  • Local storage makes drain refuse unless you pass --delete-emptydir-data or --force.
  • In autoscaling clusters, cordoned nodes still count toward capacity. The autoscaler will not provision replacements until pods are unschedulable.
  • Force-deleting stuck pods should be a last resort. It can cause split brain if the pod still runs.
  • Test your drain procedure in staging with realistic PDBs, pod counts, and grace periods.

Wiki Navigation

Prerequisites