Kubernetes Node Lifecycle -- Street Ops¶

1. Node Registration and Kubelet¶

How a Node Joins the Cluster¶

Kubelet starts on the machine with the API server endpoint and credentials (bootstrap token or certificate).
Kubelet registers the node with the API server, reporting: hostname, labels, capacity (CPU, memory, pods), allocatable resources.
API server creates a Node object. Node starts in Ready condition if kubelet can communicate.
Scheduler begins placing pods on the node based on resource requests and constraints.

Kubelet Responsibilities¶

Reports node conditions to the API server (heartbeat).
Pulls container images and starts/stops containers via the container runtime (containerd, CRI-O).
Enforces resource limits (CPU, memory) via cgroups.
Runs liveness, readiness, and startup probes.
Reports pod status back to the API server.
Manages volume mounts.
Garbage collects dead containers and unused images.

Kubelet Failure Modes¶

Kubelet crashes/stops: Node goes NotReady after node-monitor-grace-period (default 40s). Pods are evicted after pod-eviction-timeout (default 5m).
Kubelet running but unhealthy: OOM on the node can kill kubelet or make it unresponsive. Disk pressure, PID pressure, or memory pressure trigger node conditions that affect scheduling.
Kubelet cannot reach API server: Network partition. Node goes NotReady from the control plane's perspective. Pods keep running locally but no updates are received.

Key Kubelet Flags¶

--node-labels           # Labels applied at registration
--register-with-taints  # Taints applied at registration
--max-pods              # Maximum pods per node (default 110)
--eviction-hard         # Thresholds for evicting pods (memory, disk)
--kube-reserved         # Resources reserved for kubelet/system
--system-reserved       # Resources reserved for OS processes

2. Node Conditions¶

The Five Standard Conditions¶

kubectl get node <name> -o jsonpath='{.status.conditions}' | jq

Condition	Healthy Value	Meaning
Ready	True	Kubelet is healthy and ready to accept pods
MemoryPressure	False	Node has enough memory
DiskPressure	False	Node has enough disk space
PIDPressure	False	Node has enough process IDs
NetworkUnavailable	False	Node network is properly configured

Key: Ready=True is good. All pressure conditions should be False. If any pressure is True, the scheduler stops placing new pods on the node and may evict existing pods.

Interpreting NotReady¶

kubectl describe node <name>

Look at: 1. Conditions section: Which condition changed? When? 2. Events section: Recent kubelet events. 3. Allocatable vs Capacity: Resource pressure indicators.

Common NotReady causes: - Kubelet stopped/crashed - Container runtime (containerd/CRI-O) stopped - Node ran out of memory and OOM killer hit kubelet - Network partition between node and control plane - Disk full (kubelet cannot function) - Clock skew (TLS certificate validation fails) - Node literally powered off

NotReady Debug Workflow¶

1. Can you SSH to the node?
   No -> Check if the node is powered on (cloud console, BMC). Check network.
   Yes -> 2

2. Is kubelet running?
   systemctl status kubelet
   Not running -> Check logs: journalctl -u kubelet -n 100
   Running -> 3

3. Is the container runtime running?
   systemctl status containerd  # or crio
   Not running -> Start it: systemctl start containerd
   Running -> 4

4. Can kubelet reach the API server?
   curl -k https://<api-server>:6443/healthz
   No -> Check network, firewall, DNS. Check if API server is up.
   Yes -> 5

5. Check kubelet logs for errors:
   journalctl -u kubelet -f
   Look for: certificate errors, resource pressure, runtime errors.

6. Check node resources:
   free -m (memory), df -h (disk), ps aux | wc -l (PIDs)
   Any at limit? -> That is the pressure condition causing NotReady.

3. Taints and Tolerations¶

The Mental Model¶

Taints are on nodes. They repel pods. Tolerations are on pods. They allow pods to be scheduled on tainted nodes.

Taint Syntax¶

# Add a taint
kubectl taint nodes <node> key=value:effect

# Effects:
# NoSchedule     - New pods without toleration are not scheduled. Existing pods stay.
# PreferNoSchedule - Scheduler tries to avoid, but will use if necessary.
# NoExecute      - New pods rejected AND existing pods without toleration are evicted.

# Examples:
kubectl taint nodes node1 maintenance=true:NoSchedule
kubectl taint nodes node1 gpu=true:NoSchedule
kubectl taint nodes node1 dedicated=special:NoExecute

# Remove a taint (note the trailing minus)
kubectl taint nodes node1 maintenance=true:NoSchedule-

Built-in Taints¶

Kubernetes automatically applies taints for node conditions: - node.kubernetes.io/not-ready:NoExecute -- Node is NotReady - node.kubernetes.io/unreachable:NoExecute -- Node is unreachable - node.kubernetes.io/disk-pressure:NoSchedule -- Disk pressure - node.kubernetes.io/memory-pressure:NoSchedule -- Memory pressure - node.kubernetes.io/pid-pressure:NoSchedule -- PID pressure - node.kubernetes.io/unschedulable:NoSchedule -- Node is cordoned

Toleration in Pod Spec¶

tolerations:
- key: "maintenance"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300   # Stay for 5 minutes, then evict

Operational Use Cases¶

Dedicated nodes: Taint nodes for specific workloads (GPU, high-memory). Only pods with matching tolerations run there.
Maintenance window: Taint a node to prevent new scheduling before drain.
Problematic node: Taint with NoExecute to evict all pods that do not explicitly tolerate the condition.

4. Cordoning and Draining¶

Cordon¶

kubectl cordon <node>

- Marks the node as unschedulable (adds node.kubernetes.io/unschedulable:NoSchedule taint). - Existing pods keep running. - No new pods will be scheduled on this node. - Use before drain to prevent new pods from landing during the drain process.

Uncordon¶

kubectl uncordon <node>

- Removes the unschedulable taint. - Node is available for scheduling again.

Drain¶

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

- Cordons the node (if not already). - Evicts all pods (except DaemonSet pods). - Respects PodDisruptionBudgets. - Waits for pods to terminate.

Drain Flags You Will Actually Use¶

# Standard drain for maintenance
kubectl drain <node> \
  --ignore-daemonsets \          # DaemonSet pods cannot be evicted
  --delete-emptydir-data \       # Allow eviction of pods with emptyDir volumes
  --grace-period=30 \            # Override pod termination grace period
  --timeout=300s \               # Give up after 5 minutes
  --force                        # Evict pods not managed by a controller (bare pods)

Safe Drain Workflow¶

1. Cordon the node.
   kubectl cordon <node>
   Verify: kubectl get node <node> -- shows SchedulingDisabled.

2. Check what pods are running.
   kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>
   Identify: pods with PDBs, stateful workloads, pods without controllers.

3. Pre-check PDB headroom.
   kubectl get pdb --all-namespaces
   For each PDB: is disruptionsAllowed > 0? If 0, drain will block.

4. Drain.
   kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s

5. Monitor.
   Watch the drain output. If it stalls, check which pod is blocking (see stuck drain section).

6. Verify all pods evacuated.
   kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>
   Only DaemonSet pods should remain.

7. Perform maintenance (upgrade, reboot, etc.).

8. Uncordon.
   kubectl uncordon <node>

9. Verify node is scheduling again.
   kubectl get node <node> -- shows Ready, no SchedulingDisabled.

5. PodDisruptionBudgets (PDBs)¶

What PDBs Do¶

PDBs define the minimum number of pods that must remain available during voluntary disruptions (drain, node upgrade, rolling update). They protect applications from losing too many replicas simultaneously.

PDB Spec¶

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2          # At least 2 pods must be running
  # OR
  maxUnavailable: 1         # At most 1 pod can be down
  selector:
    matchLabels:
      app: my-app

PDB Gotchas¶

Gotcha 1: PDB blocks drain when there is no headroom. If minAvailable: 2 and only 2 pods are running, drain cannot evict any of them. disruptionsAllowed is 0. Drain hangs forever (or until timeout).

Fix: Use maxUnavailable: 1 instead of minAvailable. Or ensure the deployment has more replicas than minAvailable.

Gotcha 2: PDB with 100% minAvailable. minAvailable: 100% or minAvailable equal to replica count means NO pod can ever be evicted. Drain is permanently blocked.

Gotcha 3: PDB references wrong selector. The PDB selector does not match any pods. It has no effect -- drain proceeds without disruption protection. Pods get evicted even though you thought they were protected.

Gotcha 4: PDB with a single-replica deployment. maxUnavailable: 1 with 1 replica allows eviction (the pod can be down). But minAvailable: 1 with 1 replica blocks eviction. Choose carefully.

Gotcha 5: PDB applies across nodes. PDB is cluster-wide, not per-node. If you drain two nodes simultaneously, both drains compete for PDB headroom. The second drain may block because the first already consumed the disruption budget.

Checking PDB Status¶

kubectl get pdb --all-namespaces

Key columns: - MIN AVAILABLE / MAX UNAVAILABLE: The budget. - ALLOWED DISRUPTIONS: How many more pods can be evicted right now. If 0, drain will block. - CURRENT HEALTHY: How many pods the PDB considers healthy now.

6. Dealing with Stuck Drains¶

Why Drains Get Stuck¶

PDB with zero disruptions allowed. Most common cause.
Pod with no controller (bare pod). kubectl drain will not evict it without --force.
Pod with a very long terminationGracePeriodSeconds. Drain waits for the full grace period.
Pod stuck in Terminating. Finalizers preventing pod deletion. Container runtime cannot stop the container.
DaemonSet pods without --ignore-daemonsets. Drain refuses to evict DaemonSet pods.
Pods with local storage (emptyDir) without --delete-emptydir-data. Drain refuses by default.

Stuck Drain Debug Workflow¶

1. Drain is stuck. Which pod is it waiting on?
   Check the drain output -- it tells you which pod it is trying to evict.

2. Is it a PDB issue?
   kubectl get pdb -A
   If ALLOWED DISRUPTIONS = 0 for the blocking PDB -> PDB headroom is exhausted.
   Options:
   a. Wait for unhealthy pods to become healthy (restoring headroom).
   b. Scale up the deployment to create headroom.
   c. As last resort, delete or patch the PDB (kubectl delete pdb <name> -n <ns>).

3. Is the pod stuck in Terminating?
   kubectl get pod <name> -n <ns>
   If Terminating for a long time:
   a. Check finalizers: kubectl get pod <name> -n <ns> -o jsonpath='{.metadata.finalizers}'
   b. Remove finalizers if stuck: kubectl patch pod <name> -n <ns> -p '{"metadata":{"finalizers":null}}' --type=merge
   c. Force delete: kubectl delete pod <name> -n <ns> --force --grace-period=0

4. Is it a bare pod (no controller)?
   kubectl get pod <name> -n <ns> -o jsonpath='{.metadata.ownerReferences}'
   If empty -> bare pod. Use --force flag with drain, or delete the pod manually.

5. Is it a DaemonSet pod?
   Re-run drain with --ignore-daemonsets.

6. Is it a pod with emptyDir?
   Re-run drain with --delete-emptydir-data.

The Nuclear Option¶

When you absolutely must drain now and the standard process is stuck:

# Delete the PDB temporarily
kubectl delete pdb <name> -n <namespace>

# Drain with force and short timeout
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=10 --timeout=60s

# Re-create the PDB after drain completes
kubectl apply -f pdb.yaml

Warning: This removes application protection. Only use in emergencies with stakeholder awareness.

7. Node Scaling¶

Manual Scaling¶

Cloud: Add/remove VMs via cloud provider API (Terraform, cloud CLI).
Bare metal: Provision new hardware, install OS, join to cluster (kubeadm join).
Remove: Drain node, then delete: kubectl delete node <name>.

Cluster Autoscaler¶

The Cluster Autoscaler watches for: - Scale up: Pods pending due to insufficient resources (Unschedulable). It adds nodes. - Scale down: Nodes underutilized (below threshold, default 50% resource utilization) for a period (default 10m). It drains and removes nodes.

Autoscaler Gotchas¶

PDB blocks scale-down. If draining the underutilized node would violate a PDB, the autoscaler will not remove it.
Local storage prevents scale-down. Pods with local PVs or emptyDir with data cannot be evicted by default.
DaemonSet-only nodes. A node running only DaemonSet pods looks underutilized but autoscaler knows not to remove it (DaemonSet pods are expected).
Pod annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" lets the autoscaler evict pods that would otherwise block scale-down.

Scale-Down Debug¶

Why is this node not being scaled down?

1. kubectl describe configmap cluster-autoscaler-status -n kube-system
   Look for: ScaleDown conditions, blocked nodes, reasons.

2. Common reasons:
   - Pod with local storage not marked safe-to-evict.
   - PDB prevents eviction.
   - Node has pods from a non-scaling deployment (kube-system pods).
   - Node utilization is above threshold.
   - Recently scaled up (cooldown period).

8. Node Upgrades¶

Upgrade Strategy¶

For each node (one at a time, or in batches):
1. Cordon the node.
2. Drain the node.
3. Upgrade kubelet and kubectl:
   apt-get update && apt-get install -y kubelet=<version> kubectl=<version>
   # OR on RHEL:
   yum install -y kubelet-<version> kubectl-<version>
4. Restart kubelet: systemctl daemon-reload && systemctl restart kubelet.
5. Verify node version: kubectl get node <node>.
6. Uncordon the node.
7. Verify pods are rescheduled and healthy.
8. Move to the next node.

Upgrade Heuristics¶

Never skip minor versions. Upgrade 1.28 -> 1.29 -> 1.30, not 1.28 -> 1.30.
Control plane first, then workers. The control plane must be at least as new as the workers (can be one minor version ahead).
Test in staging. Always upgrade a non-production cluster first.
Upgrade one node, verify, then batch. Canary approach.
Check deprecation notices. APIs removed in the new version can break workloads.

Rollback¶

If an upgraded node has problems: 1. Cordon and drain the problematic node. 2. Downgrade kubelet to the previous version. 3. Restart kubelet. 4. Uncordon. 5. Monitor for issues.

If the upgrade broke container runtime compatibility, you may need to also downgrade containerd/CRI-O.

Kubelet Restart Implications¶

Restarting kubelet does NOT restart containers. Running pods continue.
Kubelet re-syncs state with the API server on restart.
If kubelet was down long enough (> pod-eviction-timeout), pods may have been evicted and rescheduled elsewhere. Restarting kubelet now may cause duplicates temporarily.
Watch kubectl get pods -o wide to check for duplicate pods after kubelet restart on a node that was NotReady for a long time.

9. Decision Trees¶

"Should I Drain This Node?"¶

Is it an emergency (node on fire, security breach)?
  Yes -> Drain immediately. Skip PDB if necessary.
  No -> 1

1. Is there a maintenance window?
   Yes -> Drain during window.
   No -> Schedule one.

2. Check PDB headroom.
   All affected PDBs have allowedDisruptions > 0?
   Yes -> Safe to drain.
   No -> Scale up deployments to create headroom, then drain.

3. Check for stateful workloads.
   StatefulSets with local PVs?
   Yes -> Ensure data is replicated or backed up. Drain carefully.
   No -> Standard drain.

4. Check for bare pods.
   Any pods without controllers?
   Yes -> Warn stakeholders. Use --force or delete manually.
   No -> Standard drain.

"Node Is NotReady -- What Now?"¶

1. Can you reach the node?
   No -> Check cloud console / BMC. Power issue? Network issue?
   Yes -> 2

2. Is kubelet running?
   No -> Check logs: journalctl -u kubelet. Fix and restart.
   Yes -> 3

3. Is the container runtime running?
   No -> Start it. Kubelet depends on it.
   Yes -> 4

4. Can kubelet reach the API server?
   No -> Network issue. Check routing, DNS, firewall.
   Yes -> 5

5. Is the node under resource pressure?
   Memory? -> Identify and kill the memory hog. Or add resources.
   Disk? -> Clean up disk (unused images: crictl rmi --prune, old logs).
   PID? -> Too many processes. Identify the leak.

6. If none of the above: check kubelet logs for errors.
   Certificate expired? -> Renew certificates.
   CNI plugin failed? -> Check network plugin status.

10. Common Operational Patterns¶

Pattern: Rolling Node Replacement¶

Replace all nodes in a cluster (e.g., new OS image, new instance type):

for each node:
  1. Add a new node to the cluster.
  2. Cordon the old node.
  3. Drain the old node.
  4. Delete the old node: kubectl delete node <old>.
  5. Decommission the old machine.

This is the "blue/green for nodes" approach. Cluster capacity is maintained throughout.

Pattern: Canary Node Upgrade¶

1. Upgrade one node.
2. Uncordon it.
3. Watch for 1 hour: pod restarts, error rates, node conditions.
4. If healthy, upgrade the next batch (20% of nodes).
5. Watch for 1 hour.
6. If healthy, upgrade remaining nodes.
7. If problems at any stage, rollback affected nodes.

Pattern: Automated Drain on Node Problem¶

Node Problem Detector + custom controller that taints/drains nodes with specific problems (e.g., kernel deadlock, NTP drift, filesystem corruption). Automates the response to known bad states.

Pattern: PDB-Safe Maintenance¶

1. Check PDB status for all affected workloads.
2. Scale up deployments that have tight PDB budgets (minAvailable = replicas).
3. Drain the node.
4. After workloads stabilize on other nodes, scale back down.

This creates temporary overcapacity to ensure PDBs are never violated.

Quick Reference¶

Runbook: Node Not Ready