Kubernetes Node Lifecycle — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about Kubernetes node lifecycle management.
A node that stops reporting is not considered dead for 5 minutes¶
When a node stops sending heartbeats, the node controller waits 40 seconds (node-monitor-grace-period) before marking it NotReady. It then waits another 5 minutes (pod-eviction-timeout) before evicting pods. This 5+ minute window means that during a node failure, pods are stuck in limbo — not running on the dead node, not rescheduled elsewhere. For stateful workloads, this delay can be catastrophic.
Node heartbeats moved from NodeStatus updates to Lease objects¶
Originally, kubelet sent heartbeats by updating the full NodeStatus object every 10 seconds, which included capacity, conditions, addresses, and images — a large write to etcd. Kubernetes 1.14 introduced Lease-based heartbeats: a tiny 300-byte Lease object updated every 10 seconds, with full NodeStatus updates reduced to every 5 minutes. This decreased etcd load by ~90% in large clusters and was essential for scaling beyond 5,000 nodes.
kubectl drain is the polite way to remove a node — and it can still break things¶
kubectl drain cordons the node (prevents new scheduling), then evicts all pods one by one, respecting PodDisruptionBudgets. However, if a PDB prevents eviction (e.g., minAvailable: 1 with only 1 replica), drain hangs indefinitely. If a pod has no controller (a bare pod), drain refuses to evict it unless you pass --force. And --delete-emptydir-data is required for pods using emptyDir volumes, since eviction would lose their data.
Taints and tolerations were inspired by the immune system metaphor¶
The taint/toleration system was designed as an inverted affinity: instead of pods declaring where they want to run, nodes declare what they reject. The biological metaphor is intentional — nodes "infect" themselves with taints, and only pods with matching tolerations can survive. The NoSchedule, PreferNoSchedule, and NoExecute effects mirror increasing severity, from "prefer not to" to "actively evict existing pods."
The NotReady taint is automatically applied and triggers mass eviction¶
When a node becomes NotReady, the node controller automatically adds a node.kubernetes.io/not-ready:NoExecute taint. Pods without a toleration for this taint are evicted after tolerationSeconds (default: 300 seconds for Deployments). This automatic tainting is why pods eventually leave dead nodes — it is not a separate eviction mechanism but the taint system doing double duty.
Node graceful shutdown was added surprisingly late — Kubernetes 1.21¶
Before Kubernetes 1.21 (April 2021), when a node was shut down (for maintenance, scaling down, etc.), kubelet simply died and pods were ungracefully terminated. Graceful node shutdown added a feature where kubelet intercepts the shutdown signal, terminates pods in priority order, respects terminationGracePeriodSeconds, and runs preStop hooks. This feature requires systemd integration, meaning it only works on Linux systems with systemd.
Node auto-repair in managed Kubernetes replaces unhealthy nodes automatically¶
GKE, EKS, and AKS all offer node auto-repair features that detect unresponsive or failing nodes and automatically replace them. GKE's implementation checks for NotReady status persisting for approximately 10 minutes, then deletes the node VM and creates a replacement. This sounds reliable until you realize that if the underlying problem is infrastructure-wide (a bad AMI, a subnet exhaustion), auto-repair creates an infinite loop of node replacement attempts.
PodDisruptionBudgets protect availability during voluntary disruptions¶
PDBs specify the minimum number (or percentage) of pods that must remain available during voluntary disruptions like node drains, cluster upgrades, and autoscaler scale-downs. They do not protect against involuntary disruptions (node crashes, OOM kills). A common mistake: setting minAvailable: 100% — this prevents any voluntary disruption at all, blocking node upgrades and drains indefinitely.
Cluster Autoscaler waits 10 minutes before removing an underutilized node¶
The Cluster Autoscaler's scale-down logic considers a node underutilized if its CPU and memory requests sum to less than 50% of capacity. It then waits 10 minutes (--scale-down-unneeded-time) to confirm the node remains underutilized before removing it. This conservative approach prevents thrashing but means that after a traffic spike, you may pay for unnecessary nodes for 10+ minutes. Some teams reduce this to 2-3 minutes for cost optimization.
Preemptible/spot nodes save 60-80% but introduce chaos¶
Using preemptible (GCP) or spot (AWS) instances for Kubernetes nodes saves 60-80% on compute costs. The tradeoff: the cloud provider can reclaim these nodes with as little as 30 seconds notice (AWS) or no warning (GCP preemptible, though GCP Spot gets 30s). Running stateless workloads on spot nodes with proper PDBs, pod anti-affinity, and graceful shutdown handling is a well-understood pattern. Running stateful workloads on spot nodes is widely considered a bad idea.
Node-level problems are detected by Node Problem Detector, not kubelet¶
Kubelet monitors container health, but host-level issues (kernel deadlocks, corrupted filesystems, broken container runtimes, NTP drift) are detected by Node Problem Detector (NPD), a separate DaemonSet. NPD writes conditions and events to the Node object, which other components (like Cluster Autoscaler) can act on. Without NPD, a node with a degraded disk or kernel panic loop may appear "Ready" to Kubernetes while silently corrupting workloads.