k8s
l1
runbook
k8s-node-lifecycle
node-maintenance --- Portal | Level: L1: Foundations | Topics: Node Lifecycle & Maintenance | Domain: Kubernetes

Runbook: Node NotReady¶

Field	Value
Domain	Kubernetes
Alert	`kube_node_status_condition{condition="Ready",status="false"} == 1`
Severity	P1
Est. Resolution Time	20-45 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
kubectl get nodes -o wide

If output shows: multiple nodes NotReady simultaneously → This is likely a control plane or network issue; escalate immediately and see etcd-latency.md If output shows: a single node NotReady → Continue with the steps below

Step 1: Identify Which Nodes Are NotReady¶

Why: Knowing the count, zone, and node pool helps determine if this is a single hardware failure or a broader infrastructure problem.

kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,READY:.status.conditions[-1].status,AGE:.metadata.creationTimestamp,VERSION:.status.nodeInfo.kubeletVersion' | grep -v " True"

Expected output:

NAME                    STATUS   READY   AGE                    VERSION
ip-10-0-1-42.internal   Ready    False   2025-01-01T00:00:00Z   v1.28.0

If this fails: Run kubectl get nodes as a simpler fallback to identify the node name.

Step 2: Check Node Conditions and Events¶

Why: Kubernetes tracks specific conditions (DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable) that tell you exactly what the node is reporting as the problem.

kubectl describe node <NODE_NAME>

Expected output — look for the Conditions section:

Conditions:
  Type                 Status  Reason
  ----                 ------  ------
  MemoryPressure       False   KubeletHasSufficientMemory
  DiskPressure         True    KubeletHasDiskPressure        <-- problem here
  PIDPressure          False   KubeletHasSufficientPID
  Ready                False   KubeletNotReady

If DiskPressure is True: Skip to Step 4 — disk is the immediate cause. If NetworkUnavailable is True: The CNI plugin may be broken — check CNI pod logs (Calico/Flannel/Cilium daemonset in kube-system). If all conditions False but Ready is False: The kubelet lost contact with the API server — go to Step 3.

Step 3: SSH Into the Node and Check Kubelet Status¶

Why: A NotReady node means the kubelet stopped reporting. You must get on the node directly to diagnose.

# Get the node's internal/external IP from Step 1 output
ssh <SSH_USER>@<NODE_IP>

# Once on the node:
sudo systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago" --no-pager | tail -50

Expected output (healthy kubelet):

● kubelet.service - Kubernetes Kubelet
   Active: active (running) since ...

If kubelet is stopped or failed:

sudo systemctl restart kubelet
sudo systemctl status kubelet

If kubelet fails to restart: Check /var/log/messages or journalctl -xe for startup errors — often a bad kubeconfig or cert expiry. If kubelet is running but node is still NotReady: The kubelet is running but cannot reach the API server — check network/firewall rules.

Step 4: Check Disk and Memory Pressure¶

Why: Nodes evict pods and mark themselves NotReady when disk or memory crosses eviction thresholds. You need to free resources before the node recovers.

# Run on the node (SSH'd in from Step 3)
df -h
free -h

# Check which directories are consuming disk
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -20
du -sh /var/log/* 2>/dev/null | sort -rh | head -20

# Remove dangling Docker images if disk is the issue
sudo docker image prune -f
# Or for containerd:
sudo crictl rmi --prune

Expected output (disk cleared):

Filesystem      Size  Used Avail Use%
/dev/xvda1       50G   12G   38G  24%

If this fails: The disk may be full due to log files that cannot be pruned without application changes — escalate to the application team to reduce log verbosity.

Step 5: Check Container Runtime¶

Why: If the container runtime (containerd or Docker) has crashed or hung, the kubelet cannot manage containers and the node will be NotReady.

# Check containerd (most clusters)
sudo systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago" --no-pager | tail -30

# If using Docker (older clusters)
sudo systemctl status docker
sudo journalctl -u docker --since "30 minutes ago" --no-pager | tail -30

# Restart if needed
sudo systemctl restart containerd

Expected output (healthy containerd):

● containerd.service - containerd container runtime
   Active: active (running) since ...

If containerd cannot restart: There may be a corrupted state in /var/lib/containerd — this requires escalation as recovery involves data loss risk.

Step 6: Cordon and Drain the Node If Repair Will Take Time¶

Why: Cordoning prevents new pods from being scheduled on a broken node. Draining moves existing pods to healthy nodes, protecting SLA. Do this before attempting longer repairs.

# Cordon first — this is safe and reversible
kubectl cordon <NODE_NAME>

# Drain — this evicts pods gracefully (respect PodDisruptionBudgets)
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --grace-period=60

Expected output:

node/<NODE_NAME> cordoned
evicting pod <NAMESPACE>/<POD_NAME>
pod/<POD_NAME> evicted
node/<NODE_NAME> drained

If drain hangs: A pod may be violating a PodDisruptionBudget (PDB). Check with:

kubectl get pdb -A

You may need to force-drain: kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force — only do this after confirming with the service owner that the workload can tolerate disruption.

Step 7: Repair or Replace the Node¶

Why: If kubelet/runtime restarts do not resolve the issue, the node OS or hardware may be faulty and requires replacement.

# Option A: Uncordon after repair (node is fixed)
kubectl uncordon <NODE_NAME>
kubectl get nodes

# Option B: Terminate the node and let autoscaler replace it (cloud environments)
# AWS example — use the node's instance ID from 'kubectl describe node <NODE_NAME>'
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>

# After replacement, confirm the new node joins
kubectl get nodes -w

Expected output (uncordon):

node/<NODE_NAME> uncordoned
NAME                    STATUS   ROLES    AGE   VERSION
ip-10-0-1-42.internal   Ready    <none>   5m    v1.28.0

If this fails: The node pool autoscaler may not be configured — page the platform team to manually provision a replacement node.

Verification¶

# Confirm the issue is resolved
kubectl get nodes
kubectl get pods -A --field-selector spec.nodeName=<NODE_NAME>

Success looks like: Node shows Ready status. If node was replaced, all pods previously on the node are running on other nodes. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	SRE on-call	"Kubernetes Node NotReady in , node , kubelet/runtime unrecoverable, runbook exhausted"
Data loss suspected	Platform Lead	"Data loss risk: node had local storage (emptyDir/hostPath) for stateful workloads"
Scope expanding beyond namespace	Platform team	"Multi-node impact: nodes NotReady, possible control plane or network failure"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Document what caused the node failure in the postmortem
Verify autoscaler replaced the node if it was terminated
Check whether any PodDisruptionBudgets need adjustment to allow faster drains

Common Mistakes¶

Draining without cordoning first: If you run kubectl drain without first running kubectl cordon, the scheduler may place new pods on the node during the drain window — those pods will also be evicted moments later, causing unnecessary disruption. Always cordon before drain.
Not verifying that workloads rescheduled: After draining, engineers often assume the pods are running elsewhere. Pods with unsatisfiable node selectors, taints, or resource requirements may be stuck Pending on other nodes. Always run kubectl get pods -A -o wide | grep Pending after a drain to confirm rescheduling succeeded.
Restarting kubelet without checking why it stopped: A kubelet that crashed due to a bug will restart fine. A kubelet that failed due to certificate expiry or misconfiguration will fail again immediately. Check journalctl -u kubelet for the failure reason before restarting.

Prevention¶

Set up node health monitoring (node-problem-detector)
Configure cluster autoscaler to replace unhealthy nodes
Monitor kubelet certificate expiry
Set appropriate eviction thresholds in kubelet config
Use PodDisruptionBudgets to protect critical workloads during drains

Cross-References¶

Survival Guide: On-Call Survival Guide (pocket card version)
Topic Pack: Kubernetes Topics (deep background)
Related Runbook: etcd-latency.md — if multiple nodes are NotReady simultaneously
Related Runbook: pod-crashloop.md — if pods crash after node recovery
Related Runbook: deploy-stuck.md — if deployments stall after node drain

Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Node Lifecycle & Maintenance
Kubernetes Node Lifecycle (Topic Pack, L2) — Node Lifecycle & Maintenance
Kubernetes Node Lifecycle Flashcards (CLI) (flashcard_deck, L1) — Node Lifecycle & Maintenance
Kubernetes Ops (Production) (Topic Pack, L2) — Node Lifecycle & Maintenance
Node Maintenance (Topic Pack, L1) — Node Lifecycle & Maintenance
Skillcheck: Kubernetes Under the Covers (Assessment, L2) — Node Lifecycle & Maintenance

Runbook: Node NotReady¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Nodes Are NotReady¶

Step 2: Check Node Conditions and Events¶

Step 3: SSH Into the Node and Check Kubelet Status¶

Step 4: Check Disk and Memory Pressure¶

Step 5: Check Container Runtime¶

Step 6: Cordon and Drain the Node If Repair Will Take Time¶

Step 7: Repair or Replace the Node¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Prevention¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Node NotReady¶

Quick Assessment (30 seconds)¶

Step 1: Identify Which Nodes Are NotReady¶

Step 2: Check Node Conditions and Events¶

Step 3: SSH Into the Node and Check Kubelet Status¶

Step 4: Check Disk and Memory Pressure¶

Step 5: Check Container Runtime¶

Step 6: Cordon and Drain the Node If Repair Will Take Time¶

Step 7: Repair or Replace the Node¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Prevention¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶