Kubernetes Node Lifecycle: From Provision to Decommission
- lesson
- bare-metal-provisioning
- kubelet-internals
- node-conditions
- cordon/drain
- poddisruptionbudgets
- certificate-rotation
- version-skew
- node-resource-management
- hardware-lifecycle ---# Kubernetes Node Lifecycle: From Provision to Decommission
Topics: bare-metal provisioning, kubelet internals, node conditions, cordon/drain, PodDisruptionBudgets, certificate rotation, version skew, node resource management, hardware lifecycle Level: L2-L3 (Operations -> Advanced Operations) Time: 90-120 minutes Prerequisites: None (but you'll get more out of it if you've touched kubectl before)
The Mission¶
It's Tuesday morning. Nagios fires: server k8s-worker-0347 in rack R14 has a failing DIMM.
The iDRAC SEL confirms correctable ECC errors ticking up. Dell OpenManage predicts failure
within 72 hours. The server is a Kubernetes worker node running production pods for three
different business units.
Your job: replace the DIMM with zero service impact. No dropped requests. No pager alerts downstream. The maintenance window is tonight.
This is not a cloud problem. You cannot terminate this instance and let an autoscaler spin up another. You are walking to a rack with an antistatic wristband. And before you touch hardware, Kubernetes needs to agree that this node is empty.
This lesson traces the entire lifecycle of a bare-metal Kubernetes node — from the moment it PXE boots to the moment you pull it from the rack. Along the way, you'll learn what the kubelet actually does every 10 seconds, why drains get stuck, what certificate rotation failure looks like at 3am, and how to handle the version skew matrix without a spreadsheet.
Part 1: How a Node Joins the Cluster¶
Before we can take a node out, let's understand how it got in.
The Bootstrap Sequence on Bare Metal¶
When you rack a new server and set it to PXE boot, the journey to "Ready" looks like this:
Power on
-> BIOS/UEFI POST (memory test, PCIe enumeration, RAID init)
-> NIC PXE ROM sends DHCP DISCOVER
-> DHCP responds: IP + next-server + filename (iPXE bootloader)
-> iPXE chainloads kernel + initrd via HTTP
-> Kickstart/Autoinstall runs: partition, install OS, post-scripts
-> Reboot into production OS
-> Ansible/cloud-init configures: kubelet, containerd, certificates
-> kubelet starts, registers with API server
-> Node appears as NotReady (CNI not configured yet)
-> CNI plugin (Calico, Cilium, Flannel) initializes
-> Node transitions to Ready
-> Scheduler starts placing pods
That entire sequence — from power-on to accepting pods — takes 8-15 minutes on modern hardware. Most of that time is BIOS POST and OS install. The Kubernetes part takes under 60 seconds.
Under the Hood: When the kubelet starts, it creates a Node object in the API server containing: hostname, capacity (CPU cores, memory, max pods), labels (from
--node-labelsor kubelet config), taints (from--register-with-taints), and system info (OS, kernel version, container runtime version). The API server stores this in etcd. The kubelet then enters its main sync loop.
What the Kubelet Does Every 10 Seconds¶
The kubelet is the agent that makes a machine a Kubernetes node. Here is what it does on every sync loop iteration:
- Heartbeat: Updates its Lease object (tiny, ~300 bytes) in the
kube-node-leasenamespace. Full NodeStatus update happens every 5 minutes (or on condition change). - Pod sync: Compares desired pod state (from API server) against actual containers running on this node. Starts missing containers, kills extras.
- Probe execution: Runs liveness, readiness, and startup probes for every container that defines them.
- Status reporting: Reports pod status back to API server (Running, Failed, etc.).
- Resource monitoring: Checks memory, disk, and PID usage against eviction thresholds.
- Image garbage collection: Cleans up unused container images when disk usage exceeds threshold (default 85%).
- Container garbage collection: Removes dead containers.
Trivia: Kubelet heartbeats originally sent the full NodeStatus object every 10 seconds — all conditions, addresses, capacity, and image lists. For a 5,000-node cluster, that was 5,000 large etcd writes every 10 seconds. Kubernetes 1.14 introduced Lease-based heartbeats: a 300-byte Lease update every 10 seconds, with full NodeStatus reduced to every 5 minutes. This cut etcd load by roughly 90% and was essential for scaling Kubernetes beyond 5,000 nodes.
Static Pods: The Kubelet's Secret Feature¶
The kubelet can run pods from local manifest files without the API server. Drop a YAML file
in /etc/kubernetes/manifests/ and the kubelet runs it. This is how the control plane
bootstraps itself — kube-apiserver, etcd, kube-scheduler, and
kube-controller-manager are all static pods on control plane nodes.
# See static pods on a control plane node
ls /etc/kubernetes/manifests/
# etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
# Static pods show up with the node name suffix
kubectl get pods -n kube-system | grep worker-0347
# kube-proxy-x7k2j (DaemonSet, not static)
On worker nodes, you typically don't have static pods — but some teams use them for node-local monitoring agents or custom CNI configurations.
Part 2: Node Conditions and What "Ready" Actually Means¶
The Five Conditions¶
| Condition | Healthy Value | What It Means |
|---|---|---|
Ready |
True | Kubelet is healthy, accepting pods |
MemoryPressure |
False | Enough memory available |
DiskPressure |
False | Enough disk space available |
PIDPressure |
False | Enough process IDs available |
NetworkUnavailable |
False | CNI plugin has configured networking |
Notice the inversion: for pressure conditions, False is healthy. Ready=True is
good; everything else should be False. This trips people up in monitoring rules.
When a Node Goes NotReady¶
The node controller in the control plane watches heartbeats. Here is the timeline:
T+0s: Kubelet stops sending heartbeats (crash, network partition, power loss)
T+40s: node-monitor-grace-period expires. Node marked NotReady.
Automatic taint applied: node.kubernetes.io/not-ready:NoExecute
T+5m: Pods without a toleration for not-ready:NoExecute are evicted.
(Default tolerationSeconds is 300 — that's where the 5 minutes comes from)
T+5m+: Evicted pods rescheduled on healthy nodes.
For your 1,500-node fleet, this means a dead node costs you 5 minutes of degraded capacity
for the pods that lived on it. You can tune tolerationSeconds per-deployment for
workloads that need faster failover.
Gotcha: During that 5-minute window, pods are in limbo. They might still be running on the dead node (if it is a network partition, not a power failure), but the control plane thinks they are unreachable. For StatefulSets, this is especially dangerous — Kubernetes will not create a replacement pod until the old one is confirmed terminated, which requires the kubelet to report back. Network partition + StatefulSet = stuck pods until you manually intervene.
Flashcard Check #1¶
| Question | Answer (cover this column) |
|---|---|
| How often does the kubelet send a Lease heartbeat? | Every 10 seconds (configurable via nodeStatusUpdateFrequency) |
| After heartbeats stop, how long before a node is marked NotReady? | 40 seconds (node-monitor-grace-period) |
| After NotReady, how long before pods are evicted by default? | 5 minutes (300 second tolerationSeconds on the not-ready taint) |
| What is a static pod? | A pod managed directly by the kubelet from a local manifest file, without going through the API server |
| Which node condition value is "healthy" for MemoryPressure? | False |
Part 3: Allocatable vs Capacity — Where Your Resources Actually Go¶
On bare metal, you care about this. A lot. Your nodes have 256GB of RAM and 64 cores, but Kubernetes does not give all of that to pods.
Where did 2 CPUs and 10GB of memory go?
Total Capacity (what the hardware has)
- kube-reserved (reserved for kubelet, container runtime)
- system-reserved (reserved for OS processes: sshd, journald, etc.)
- eviction-threshold (buffer before kubelet starts evicting pods)
= Allocatable (what the scheduler can give to pods)
These reservations are set in the kubelet config:
# /var/lib/kubelet/config.yaml (relevant section)
kubeReserved:
cpu: "1"
memory: "2Gi"
systemReserved:
cpu: "1"
memory: "2Gi"
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
imagefs.available: "15%"
Mental Model: Think of it like apartment square footage. The listing says 1,200 sq ft (Capacity), but after you subtract walls, closets, and the HVAC system, you have 1,050 sq ft of livable space (Allocatable). The scheduler only looks at Allocatable. If you do not configure kube-reserved and system-reserved, the kubelet and OS processes compete with your pods for the same resources — and the OOM killer does not care about your PodDisruptionBudgets.
Real Bare-Metal Sizing¶
For a 64-core / 256GB worker node running Kubernetes at scale:
| Reservation | CPU | Memory | Why |
|---|---|---|---|
| kube-reserved | 1-2 cores | 2-4 Gi | kubelet, containerd |
| system-reserved | 1-2 cores | 2-4 Gi | sshd, journald, node_exporter, monitoring agents |
| eviction-threshold | -- | 500Mi-1Gi | Buffer before hard eviction |
| Allocatable | 60-62 cores | 248-252 Gi | What pods actually get |
The exact numbers depend on your workload density. If you run 100+ pods per node, bump kube-reserved higher — the kubelet uses more CPU proportional to pod count.
Part 4: The Mission Begins — Preparing for Hardware Maintenance¶
Back to our failing DIMM on k8s-worker-0347. Let's trace the full maintenance workflow.
Step 1: Reconnaissance¶
Before touching anything, understand the blast radius:
# What is running on this node?
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 -o wide
# How many pods? Which namespaces?
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 --no-headers | wc -l
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 --no-headers | \
awk '{print $1}' | sort | uniq -c | sort -rn
# Are there any bare pods (no controller)?
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 -o json | \
jq -r '.items[] | select(.metadata.ownerReferences == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
# What PDBs exist that might block us?
kubectl get pdb -A -o wide
Sample output from the PDB check:
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
production payment-api-pdb 2 N/A 1 45d
production order-svc-pdb N/A 1 1 45d
staging test-app-pdb 1 N/A 0 12d
That staging/test-app-pdb with ALLOWED DISRUPTIONS: 0 is trouble. If any pods
matching that PDB live on our node, the drain will hang.
Remember: Always check PDBs BEFORE starting a drain, not after it gets stuck. The command is
kubectl get pdb -A -o wide— theALLOWED DISRUPTIONScolumn is the one that matters. Zero means drain will block.
Step 2: Check for Trouble Pods¶
# Pods with very long termination grace periods
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 -o json | \
jq -r '.items[] | select(.spec.terminationGracePeriodSeconds > 60) |
"\(.metadata.namespace)/\(.metadata.name): \(.spec.terminationGracePeriodSeconds)s"'
# Pods using local storage (emptyDir)
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 -o json | \
jq -r '.items[] | select(.spec.volumes[]? | .emptyDir) |
"\(.metadata.namespace)/\(.metadata.name)"'
Any pod with a terminationGracePeriodSeconds of 300 means the drain will wait up to 5
minutes for that single pod. If you have 10 such pods, the drain could take 50+ minutes.
Part 5: Cordon vs Drain — And the Difference Matters¶
This distinction has caused more outages than any single Kubernetes concept.
Cordon: "Caution Tape"¶
What happens:
- Adds taint node.kubernetes.io/unschedulable:NoSchedule
- Node status shows Ready,SchedulingDisabled
- Existing pods keep running (this is the critical part)
- New pods will not be scheduled here
What does NOT happen: - Pods are not evicted - Pods are not notified - Nothing changes for running workloads
Gotcha: If you cordon a node and then reboot it for maintenance, every pod on that node dies ungracefully. Cordon does not move workloads off. It just stops new ones from arriving. This is the #1 misunderstanding that leads to outages during maintenance.
Drain: "Evacuation Order"¶
What happens, in order:
1. Cordons the node (if not already cordoned)
2. Finds all evictable pods on the node
3. For each pod, sends an eviction request to the API server
4. The API server checks PodDisruptionBudgets before approving each eviction
5. If the PDB allows it, the pod receives SIGTERM
6. The pod has terminationGracePeriodSeconds to shut down gracefully
7. After grace period, SIGKILL
8. Deployment/StatefulSet controller creates replacement pod on another node
9. Repeat for every pod
DaemonSet pods are skipped (that is what --ignore-daemonsets acknowledges — they are
supposed to run on every node, so "moving" them makes no sense).
The Drain Flag Reference¶
| Flag | What It Does | When You Need It |
|---|---|---|
--ignore-daemonsets |
Skips DaemonSet pods | Always (drain refuses without it) |
--delete-emptydir-data |
Allows evicting pods with emptyDir volumes | Almost always |
--timeout=Ns |
Aborts drain if it takes too long | Always (never run drain without a timeout) |
--grace-period=Ns |
Overrides each pod's terminationGracePeriodSeconds | When pods have excessively long grace periods |
--force |
Evicts bare pods (no controller) | Only when you know the pod won't be rescheduled |
--disable-eviction |
Bypasses PDBs entirely (deletes pods directly) | Emergency only — you are choosing speed over safety |
Under the Hood:
kubectl drainuses the Eviction API (POST /api/v1/namespaces/{ns}/pods/{name}/eviction), notDELETE. The Eviction API is what checks PDBs. If youkubectl delete poddirectly, PDBs are ignored — that is why--disable-evictionis dangerous. The eviction API was specifically designed to give PDBs enforcement power.
Flashcard Check #2¶
| Question | Answer (cover this column) |
|---|---|
What taint does kubectl cordon add? |
node.kubernetes.io/unschedulable:NoSchedule |
| Does cordoning evict existing pods? | No. Only prevents new pod scheduling. |
What API does kubectl drain use to remove pods? |
The Eviction API (/eviction endpoint) |
Why does drain need --ignore-daemonsets? |
DaemonSet pods are expected on every node; evicting them is pointless since they would be recreated immediately |
What happens if you drain without --timeout? |
Drain can hang indefinitely if a PDB cannot be satisfied |
Part 6: When Drain Gets Stuck — The 2-Hour PDB War Story¶
War Story: A team at a large financial institution started a routine kernel patching cycle on a Thursday afternoon. The runbook said "drain node, patch, uncordon." The first two nodes drained in 4 minutes each. The third node hung. For two hours. The drain output just sat there: "evicting pod production/payment-gateway-7f8d9c6b4-x2k1p." The on-call engineer tried Ctrl+C and re-ran the drain. Same result. They tried
--force. It refused because the pod had a controller.The root cause:
payment-gatewayhad a PDB withminAvailable: 3and exactly 3 replicas. All three happened to be scheduled across only two nodes (pod anti-affinity waspreferredDuringScheduling, notrequired). With two nodes already drained, all 3 replicas were on the remaining nodes. Draining a third node would drop a replica belowminAvailable: 3. The drain politely refused. For two hours, nobody thought to checkkubectl get pdb -A -o wide.The fix took 30 seconds:
kubectl scale deployment payment-gateway -n production --replicas=4. A 4th replica came up on a healthy node. PDB headroom went from 0 to 1. Drain completed immediately. The postmortem added "check PDB headroom" as the first step in the maintenance runbook.
Diagnosing a Stuck Drain¶
The drain is hanging. Here is the systematic ladder:
# 1. What pod is it stuck on?
# The drain output tells you. If you cancelled it, find remaining pods:
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 \
--no-headers | grep -v kube-system
# 2. Is it a PDB issue?
kubectl get pdb -A -o wide | grep "0"
# Look at ALLOWED DISRUPTIONS column. 0 = drain is blocked.
# 3. Is it a pod stuck in Terminating?
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 | grep Terminating
# If so, check finalizers:
kubectl get pod <name> -n <ns> -o jsonpath='{.metadata.finalizers}'
# 4. Is it a bare pod?
kubectl get pod <name> -n <ns> -o jsonpath='{.metadata.ownerReferences}'
# Empty = bare pod. Drain needs --force.
# 5. Is it a long terminationGracePeriodSeconds?
kubectl get pod <name> -n <ns> -o jsonpath='{.spec.terminationGracePeriodSeconds}'
# 600? That pod gets 10 minutes to shut down.
Fix Options (Ordered by Safety)¶
-
Wait. If the PDB is temporarily at 0 because a pod is restarting, headroom will return when it becomes Ready.
-
Scale up the deployment. Create headroom:
kubectl scale deployment <name> --replicas=<current+1>. Once the new replica is Ready, drain proceeds. Scale back down after. -
Patch the PDB.
kubectl patch pdb <name> -n <ns> -p '{"spec":{"minAvailable": <lower-number>}}'. Restore after drain. Document why. -
Delete the PDB temporarily.
kubectl delete pdb <name> -n <ns>. Drain. Re-apply. This removes all disruption protection. -
Use
--disable-eviction. Bypasses PDBs entirely. Pods are deleted, not evicted. Emergency only.
Part 7: Certificate Rotation — The Silent Killer¶
Your nodes are draining fine. Maintenance is going smoothly. Then three months from now, a node silently goes NotReady. No hardware fault. No memory pressure. No network issue.
The kubelet's client certificate expired.
How Kubelet Certificate Rotation Works¶
When a kubelet first joins the cluster, it authenticates using a bootstrap token and receives a client certificate from the API server. This certificate has an expiration (default: 1 year in kubeadm clusters).
The kubelet is supposed to rotate this certificate automatically:
When rotateCertificates is true, the kubelet generates a new private key, creates a
Certificate Signing Request (CSR), submits it to the API server, and uses the new cert
once approved. This happens automatically — usually.
When Auto-Rotation Fails¶
# Check for pending CSRs (healthy rotation looks like approved CSRs)
kubectl get csr
# NAME AGE SIGNERNAME REQUESTOR CONDITION
# csr-abc12 2m kubernetes.io/kubelet-serving system:node:0347 Approved,Issued
# csr-def34 5s kubernetes.io/kubelet-serving system:node:0347 Pending
If you see CSRs stuck in Pending, rotation has stalled. Common causes:
| Cause | Symptom | Fix |
|---|---|---|
| CSR auto-approver not running | CSRs pile up in Pending | Check kube-controller-manager logs; ensure CSR approval is enabled |
| Clock skew on the node | Certificate appears "not yet valid" or "expired" | Sync NTP: chronyc tracking, fix time, restart kubelet |
| RBAC misconfiguration | kubelet cannot submit CSR | Check ClusterRoleBindings for system:node |
| Kubelet restarted with wrong cert | Auth failure on heartbeat | Check kubelet config points to correct cert paths |
Gotcha: Clock skew is the sneakiest cause. If your bare-metal node's hardware clock drifts even 5 minutes from the API server, TLS certificate validation fails in both directions. The kubelet cannot authenticate to the API server, and it cannot validate the API server's certificate either. The node goes NotReady and kubelet logs show cryptic TLS handshake errors. Always run chrony/NTP on bare metal nodes and monitor for drift.
Checking Certificate Expiration¶
# On the node itself
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# notBefore=Mar 23 00:00:00 2026 GMT
# notAfter=Mar 23 00:00:00 2027 GMT
# From the control plane (kubeadm clusters)
kubeadm certs check-expiration
# Monitor with Prometheus (if using kube-prometheus-stack)
# Alert on: kubelet_certificate_manager_client_expiration_renew_errors > 0
Remember: Set an alert for certificates expiring within 30 days. Monitor the kubelet metric
kubelet_certificate_manager_client_ttl_seconds. If it drops and the CSR is not approved, you have a problem.
Part 8: Node Upgrades and the Version Skew Matrix¶
The Version Skew Policy¶
Kubernetes enforces strict version compatibility between components:
| Component | Allowed Skew from API Server | Example (API server 1.30) |
|---|---|---|
| kubelet | Up to 3 minor versions older, never newer | 1.27, 1.28, 1.29, 1.30 |
| kubectl | +/- 1 minor version | 1.29, 1.30, 1.31 |
| kube-proxy | Same minor as kubelet | Must match kubelet |
| Controller-manager, scheduler | 0 (must match API server) | 1.30 only |
The upgrade order is always: 1. etcd (if upgrading etcd) 2. API server 3. Controller-manager and scheduler 4. kubelet and kube-proxy on each worker (one at a time)
You cannot skip minor versions. To go from 1.28 to 1.30, you must upgrade to 1.29 first. Each hop requires a full control-plane-then-workers cycle.
Trivia: The kubelet skew policy was expanded from n-2 to n-3 in Kubernetes 1.28, which was a major quality-of-life improvement for large fleets. With 1,500 nodes, a rolling upgrade takes time. The wider skew window means you are not racing to finish the worker upgrades before the kubelet version falls out of support. Previously, with only a 2-version window, teams with large bare-metal fleets sometimes had to pause application deployments during upgrade cycles to avoid the skew boundary.
Bare-Metal Upgrade Workflow¶
On bare metal, you do not replace nodes. You upgrade them in place:
# For each worker node, one at a time:
# 1. Cordon and drain
kubectl cordon k8s-worker-0347
kubectl drain k8s-worker-0347 \
--ignore-daemonsets --delete-emptydir-data --timeout=300s
# 2. SSH to the node and upgrade packages
ssh k8s-worker-0347
# On the node (RHEL/Rocky):
dnf install -y kubeadm-1.30.4 --disableexcludes=kubernetes
kubeadm upgrade node
dnf install -y kubelet-1.30.4 kubectl-1.30.4 --disableexcludes=kubernetes
systemctl daemon-reload
systemctl restart kubelet
# 3. Verify from control plane
kubectl get node k8s-worker-0347
# NAME STATUS VERSION
# k8s-worker-0347 Ready,SchedulingDisabled v1.30.4
# 4. Uncordon
kubectl uncordon k8s-worker-0347
# 5. Wait for stabilization before moving to next node
kubectl wait --for=condition=Ready node/k8s-worker-0347 --timeout=120s
For 1,500 nodes, you automate this with Ansible and do it in batches:
# Canary: upgrade 1 node, monitor for 1 hour
# Wave 1: upgrade 5% of nodes, monitor for 2 hours
# Wave 2: upgrade 20%, monitor for 1 hour
# Wave 3: remaining nodes in batches of 50
Gotcha: Restarting the kubelet does NOT restart your containers. Running pods continue. The kubelet re-syncs state with the API server on restart. But if the kubelet was down longer than
pod-eviction-timeout(default 5 minutes), the control plane may have already evicted those pods and rescheduled them elsewhere. When the kubelet comes back, you briefly get duplicate pods until the kubelet reconciles.
Flashcard Check #3¶
| Question | Answer (cover this column) |
|---|---|
| Can a kubelet be a newer version than the API server? | No. Never. The kubelet must be equal to or older than the API server. |
| What is the maximum kubelet-to-API-server version skew (as of 1.28+)? | 3 minor versions |
| What is the upgrade order? | etcd -> API server -> controller-manager/scheduler -> kubelet (workers) |
| Can you skip minor versions during upgrade? | No. Must upgrade one minor version at a time. |
| Does restarting the kubelet restart running containers? | No. Containers continue running; kubelet re-syncs state on restart. |
Part 9: The Full Drain Sequence — What Happens to Each Pod Type¶
Let's trace what happens when you run kubectl drain on our node, for each type of pod:
Regular Deployment Pod (e.g., payment-api-7f8d9c-x2k1p)¶
1. Drain sends eviction request to API server
2. API server checks PDB for payment-api
-> ALLOWED DISRUPTIONS > 0? Yes -> approve eviction
3. Pod receives SIGTERM
4. PreStop hook runs (if defined) — e.g., deregister from service mesh
5. Application catches SIGTERM, starts graceful shutdown
- Stops accepting new requests
- Finishes in-flight requests
- Closes database connections
6. Readiness probe fails -> endpoint removed from Service
- No new traffic routed to this pod
7. terminationGracePeriodSeconds countdown (default 30s)
8. If still running after grace period: SIGKILL
9. Container stops. Pod status -> Succeeded/Failed
10. Deployment controller notices replica count is below desired
11. Scheduler places new pod on a healthy node
12. New pod starts, passes readiness probe, receives traffic
DaemonSet Pod (e.g., node-exporter-k7j2m)¶
1. Drain sees it is a DaemonSet pod
2. --ignore-daemonsets flag present? Yes -> skip
3. Pod keeps running untouched
4. (After maintenance, when node uncordons, pod is still there)
StatefulSet Pod (e.g., elasticsearch-data-2)¶
1-8. Same as Deployment pod
9. StatefulSet controller creates replacement with SAME name (elasticsearch-data-2)
on a different node
10. If using PersistentVolumeClaims, the PVC must be satisfied on the new node
- For bare metal with local-path-provisioner: the PV is node-local, so data is LOST
- For network storage (Ceph, NFS): PV reattaches on new node, data preserved
11. Pod starts, reattaches volumes, rejoins cluster
Bare Pod (e.g., debug-pod-jsmith, created with kubectl run)¶
1. Drain sends eviction request
2. Pod has no controller (ownerReferences is empty)
3. Without --force: drain REFUSES to evict. Drain hangs.
4. With --force: pod is evicted and PERMANENTLY GONE.
No controller exists to recreate it.
Mental Model: Think of drain as a polite building evacuation. Deployment pods are tenants with a lease — they will be relocated by the property manager (controller). DaemonSet pods are building staff — they stay. StatefulSet pods are tenants with named parking spots — they get the same spot number at the new building. Bare pods are squatters — evacuation removes them and nobody brings them back.
Part 10: Node Replacement on Bare Metal¶
Back to our mission. The drain completed. The node is empty (except DaemonSet pods).
The Physical Maintenance¶
# On the Kubernetes side, the node shows:
kubectl get node k8s-worker-0347
# NAME STATUS ROLES AGE VERSION
# k8s-worker-0347 Ready,SchedulingDisabled worker 423d v1.29.6
# SSH to the node and shut it down
ssh k8s-worker-0347 "sudo shutdown -h now"
# Or via IPMI/iDRAC if SSH is not responsive:
ipmitool -I lanplus -H 10.0.99.47 -U root -P <pass> chassis power off
Now you physically replace the DIMM. After hardware work is done:
# Power on via IPMI
ipmitool -I lanplus -H 10.0.99.47 -U root -P <pass> chassis power on
# Watch the node come back
kubectl get node k8s-worker-0347 -w
# STATUS goes: NotReady -> Ready,SchedulingDisabled (still cordoned)
Post-Maintenance Validation¶
Before uncordoning, verify the node is healthy:
# Check node conditions
kubectl describe node k8s-worker-0347 | grep -A 5 Conditions
# Check kubelet is running
ssh k8s-worker-0347 "systemctl status kubelet"
# Check container runtime
ssh k8s-worker-0347 "systemctl status containerd"
ssh k8s-worker-0347 "crictl info"
# Verify hardware (after DIMM replacement, confirm memory)
ssh k8s-worker-0347 "free -g"
ssh k8s-worker-0347 "dmidecode -t memory | grep -i 'size\|locator' | head -20"
# Check DIMM health via iDRAC or ipmitool
ipmitool -I lanplus -H 10.0.99.47 -U root -P <pass> sdr list | grep -i mem
Uncordon and Verify¶
kubectl uncordon k8s-worker-0347
# Verify it is scheduling again
kubectl get node k8s-worker-0347
# STATUS: Ready (no SchedulingDisabled)
# Watch pods land on the node over the next few minutes
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 --watch
Pods will not migrate back automatically. The scheduler places new pods on the freshly uncordoned node as new scheduling decisions happen (new deployments, scaling events, pod restarts). If you want to rebalance immediately, use the descheduler.
Part 11: Full Node Replacement (When the Hardware is Dead)¶
Sometimes the DIMM replacement is not enough. The motherboard is dead. The node is gone.
Remove the Dead Node¶
# 1. Force-delete pods that are stuck in Terminating on the dead node
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 -o name | \
xargs -I{} kubectl delete {} --grace-period=0 --force
# 2. Delete the node object
kubectl delete node k8s-worker-0347
Gotcha: If you delete the node object BEFORE cleaning up pods, those pods enter a zombie state — stuck in Terminating forever because the kubelet that would finalize them no longer exists. StatefulSets are hit hardest: the controller will not create a replacement pod until the old one is fully terminated. Always force-delete pods first, then delete the node.
Provision a Replacement¶
For a full replacement on bare metal, the sequence is:
1. Rack new hardware
2. Connect to provisioning VLAN + production VLAN
3. Set PXE boot via IPMI:
ipmitool -I lanplus -H <bmc-ip> -U root -P <pass> chassis bootdev pxe
4. Power on:
ipmitool -I lanplus -H <bmc-ip> -U root -P <pass> chassis power on
5. Server PXE boots -> Kickstart installs OS
6. Post-install Ansible configures kubelet, containerd, certs
7. kubelet starts, registers with API server
8. Apply correct labels:
kubectl label node k8s-worker-0347 \
topology.kubernetes.io/zone=dc1-row3 \
node.kubernetes.io/instance-type=dell-r750 \
team=platform
9. Apply correct taints (if any):
kubectl taint nodes k8s-worker-0347 <taint-key>=<value>:<effect>
10. Verify node is Ready:
kubectl get node k8s-worker-0347
11. Monitor for 30 minutes before declaring success
Gotcha: If you forget labels and taints on the replacement node, pods with nodeSelectors or affinities will not schedule there, and pods that should NOT run there (because of missing taints) will flood it. Automate label/taint application as part of the bootstrap — do not rely on someone remembering.
Part 12: Node Problem Detector and Monitoring¶
Node Problem Detector¶
The kubelet monitors container health but is blind to host-level problems. Node Problem Detector (NPD) fills the gap:
What NPD Detects:
- Kernel deadlocks (from /proc/kmsg)
- Kernel panics
- Corrupt filesystem
- Container runtime hangs (containerd unresponsive)
- NTP out of sync
- Hardware errors (MCE — Machine Check Exceptions)
NPD runs as a DaemonSet and writes conditions/events to the Node object. Other components (monitoring, custom controllers) can act on these conditions.
Without NPD, a node with a degraded disk or NTP drift may appear "Ready" while silently corrupting workloads.
Bare-Metal Monitoring Stack¶
For 1,500 bare-metal nodes, your monitoring stack needs to cover:
| Layer | Tool | Key Metrics |
|---|---|---|
| Hardware | IPMI sensors, iDRAC, node_exporter | Temperatures, fan speeds, ECC errors, power draw |
| OS | node_exporter | CPU, memory, disk, network, filesystem |
| Kubelet | kubelet metrics (:10250/metrics) | Pod start latency, runtime operations, cert TTL |
| Container runtime | containerd metrics | Image pull latency, container create/start time |
| Node conditions | kube-state-metrics | kube_node_status_condition for alerting on NotReady |
| Network | node_exporter + custom | Bond status, NIC error counters, LLDP neighbor changes |
Real Bare-Metal Concerns¶
Things you deal with on bare metal that cloud abstracts away:
BIOS settings drift. One server has C-states enabled (power saving, adds latency jitter). Another has Turbo Boost disabled. Performance is inconsistent across nodes. Fix: enforce BIOS profiles via racadm/iDRAC Redfish API as part of provisioning.
NIC bonding. Your servers have 2x25G NICs bonded for redundancy. If one link drops,
the bond should handle it transparently. But mode 4 (802.3ad/LACP) requires switch
cooperation. If the switch side is not configured, the bond degrades silently. Monitor bond
status: /sys/class/net/bond0/bonding/slaves, check for degraded bonds in node_exporter.
NUMA topology. On a 2-socket server, memory access latency depends on which socket's
memory you hit. Kubelet has topology-aware allocation (--topology-manager-policy=single-numa-node)
for latency-sensitive workloads. Without it, a pod's threads might be on socket 0 while its
memory is on socket 1 — adding 50-100ns per memory access.
Firmware updates. BIOS, iDRAC, NIC firmware, RAID controller firmware, drive firmware. Each needs a maintenance window. Some require reboots (drain first). Dell and HPE provide tools for rolling firmware updates, but they still need a Kubernetes-aware wrapper that cordons and drains before rebooting.
Storage drivers. If your nodes use local NVMe for container storage, you need the right kernel driver and filesystem tuning. XFS for general purpose, ext4 if you need specific features. Watch for NVMe wear-level warnings in SMART data.
Flashcard Check #4¶
| Question | Answer (cover this column) |
|---|---|
| What is Node Problem Detector? | A DaemonSet that detects host-level issues (kernel deadlocks, filesystem corruption, NTP drift) and reports them as node conditions |
| What is the difference between Capacity and Allocatable? | Capacity is total hardware resources; Allocatable = Capacity minus kube-reserved, system-reserved, and eviction-threshold |
| What should you do before deleting a dead node object? | Force-delete all pods on that node first, to prevent zombie Terminating pods |
| Why do labels matter when replacing a bare-metal node? | Pods with nodeSelectors or affinity rules will not schedule on a node missing the expected labels |
| What is NUMA-aware scheduling? | Kubelet's topology manager can restrict a pod's CPU and memory to a single NUMA node, reducing cross-socket memory access latency |
Part 13: Node Decommission¶
When a server reaches end-of-life (lease expired, hardware EOL, capacity consolidation):
# 1. Cordon
kubectl cordon k8s-worker-0347
# 2. Drain
kubectl drain k8s-worker-0347 \
--ignore-daemonsets --delete-emptydir-data --timeout=600s
# 3. Verify empty
kubectl get pods -A --field-selector spec.nodeName=k8s-worker-0347 --no-headers | \
grep -v kube-system
# 4. Delete node from Kubernetes
kubectl delete node k8s-worker-0347
# 5. Shut down via IPMI
ipmitool -I lanplus -H 10.0.99.47 -U root -P <pass> chassis power off
# 6. Hardware lifecycle
# - Update CMDB: mark server as decommissioned
# - Remove DNS records
# - Remove DHCP reservations
# - Wipe disks (NIST 800-88 if security-sensitive)
# - Pull server from rack
# - Return to vendor or asset disposal
The Kubernetes side is the easy part. The hardware lifecycle tracking — knowing which servers are where, what state they are in, who owns them — is what makes bare-metal ops at scale genuinely hard.
Exercises¶
Exercise 1: Pre-Drain Audit (5 minutes)¶
Run these commands against a staging cluster. Identify which PDBs would block a drain:
kubectl get pdb -A -o wide
# Which ones show ALLOWED DISRUPTIONS = 0?
# For each, check: what deployment does it protect, and how many replicas exist?
What to look for
If `ALLOWED DISRUPTIONS` is 0, check the deployment: If `minAvailable` equals replica count (or `maxUnavailable: 0`), no pod can ever be evicted. Scale up the deployment or adjust the PDB before draining.Exercise 2: Drain Simulation (10 minutes)¶
In a test cluster, create a deployment with a PDB that blocks drain:
kubectl create deployment drain-test --image=nginx --replicas=2
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: drain-test-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: drain-test
EOF
Now try to drain the node one of those pods is on. Watch it hang. Fix it by scaling to 3 replicas.
Commands
# Find which node a pod is on
kubectl get pods -o wide -l app=drain-test
# Drain that node (this will hang)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=30s
# In another terminal, scale up
kubectl scale deployment drain-test --replicas=3
# Drain should complete now. Clean up:
kubectl delete pdb drain-test-pdb
kubectl delete deployment drain-test
kubectl uncordon <node>
Exercise 3: Certificate Audit (5 minutes)¶
Check the certificate expiration on a node and verify auto-rotation is working:
# On the node:
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# From the control plane:
kubectl get csr --sort-by=.metadata.creationTimestamp | tail -10
# Are CSRs being approved? Any stuck in Pending?
Cheat Sheet¶
Node Maintenance Workflow¶
1. CHECK PDBs kubectl get pdb -A -o wide
2. RECON kubectl get pods -A --field-selector spec.nodeName=<node>
3. CORDON kubectl cordon <node>
4. DRAIN kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s
5. WORK (patch, replace hardware, upgrade)
6. VERIFY kubectl describe node <node> | grep Conditions
7. UNCORDON kubectl uncordon <node>
Stuck Drain Decision Tree¶
Drain hanging?
-> kubectl get pdb -A -o wide
ALLOWED DISRUPTIONS = 0?
-> Scale up deployment OR patch PDB OR (emergency) delete PDB
Pod in Terminating?
-> Check finalizers, remove if stuck
Bare pod (no controller)?
-> Use --force (pod will not be recreated)
DaemonSet pod?
-> Use --ignore-daemonsets
Key kubelet Config Parameters¶
| Parameter | Default | What It Controls |
|---|---|---|
nodeStatusUpdateFrequency |
10s | How often kubelet sends heartbeats |
--max-pods |
110 | Max pods per node |
kubeReserved |
none | CPU/memory reserved for kubelet |
systemReserved |
none | CPU/memory reserved for OS |
evictionHard |
varies | Thresholds for pod eviction |
rotateCertificates |
true | Enable automatic cert rotation |
--topology-manager-policy |
none | NUMA-aware scheduling |
Version Skew Quick Reference (as of 1.28+)¶
| Component | Relative to API Server |
|---|---|
| kubelet | Same, or up to 3 minor versions older |
| kubectl | +/- 1 minor version |
| kube-proxy | Same minor version as kubelet |
| Upgrade order | etcd -> API server -> controller-manager -> scheduler -> workers |
Essential Monitoring Queries (PromQL)¶
# Nodes not Ready for > 5 minutes
kube_node_status_condition{condition="Ready",status="true"} == 0
# Kubelet certificate TTL dropping
kubelet_certificate_manager_client_ttl_seconds < 86400 * 30
# Node disk pressure
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
# Pods evicted in last hour
increase(kubelet_evictions{eviction_signal!=""}[1h]) > 0
Takeaways¶
-
Check PDBs before draining, not after it hangs.
kubectl get pdb -A -o wideis the first command in any maintenance workflow. -
Cordon is not drain. Cordon stops new pods; drain moves existing ones. Rebooting a cordoned-but-undrained node kills everything on it.
-
Certificate rotation is automatic until it isn't. Monitor kubelet cert TTL and CSR approval. Clock skew on bare metal is the most common silent cause of rotation failure.
-
Allocatable is not Capacity. Configure kube-reserved and system-reserved or the OOM killer will make scheduling decisions for you.
-
The version skew policy has an order. Control plane first, workers second, never skip minor versions, kubelet can never be newer than the API server.
-
Bare metal adds a hardware lifecycle that cloud hides. BIOS settings, NIC bonds, NUMA topology, firmware updates, storage drivers, physical decommission — these are real work that real engineers do, and they all require cordon-drain-work-uncordon.
Related Lessons¶
- What Happens When You kubectl apply — follows the API server -> scheduler -> kubelet -> container path in detail
- PXE Boot: From Network to Running Server — deep dive into the bare-metal provisioning chain
- What Happens When Kubernetes Evicts Your Pod — the eviction side of node pressure and resource management
- Server Hardware: When the Blinky Lights Matter — IPMI, BMC, DIMM topology, and datacenter ops
- etcd: The Database That Runs Kubernetes — where all the node objects, leases, and pod specs live