Skip to content

Runbook: Node NotReady

Field Value
Domain Kubernetes
Alert kube_node_status_condition{condition="Ready",status="false"} == 1
Severity P1
Est. Resolution Time 20-45 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get nodes -o wide
If output shows: multiple nodes NotReady simultaneously → This is likely a control plane or network issue; escalate immediately and see etcd-latency.md If output shows: a single node NotReady → Continue with the steps below

Step 1: Identify Which Nodes Are NotReady

Why: Knowing the count, zone, and node pool helps determine if this is a single hardware failure or a broader infrastructure problem.

kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,READY:.status.conditions[-1].status,AGE:.metadata.creationTimestamp,VERSION:.status.nodeInfo.kubeletVersion' | grep -v " True"
Expected output:
NAME                    STATUS   READY   AGE                    VERSION
ip-10-0-1-42.internal   Ready    False   2025-01-01T00:00:00Z   v1.28.0
If this fails: Run kubectl get nodes as a simpler fallback to identify the node name.

Step 2: Check Node Conditions and Events

Why: Kubernetes tracks specific conditions (DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable) that tell you exactly what the node is reporting as the problem.

kubectl describe node <NODE_NAME>
Expected output — look for the Conditions section:
Conditions:
  Type                 Status  Reason
  ----                 ------  ------
  MemoryPressure       False   KubeletHasSufficientMemory
  DiskPressure         True    KubeletHasDiskPressure        <-- problem here
  PIDPressure          False   KubeletHasSufficientPID
  Ready                False   KubeletNotReady
If DiskPressure is True: Skip to Step 4 — disk is the immediate cause. If NetworkUnavailable is True: The CNI plugin may be broken — check CNI pod logs (Calico/Flannel/Cilium daemonset in kube-system). If all conditions False but Ready is False: The kubelet lost contact with the API server — go to Step 3.

Step 3: SSH Into the Node and Check Kubelet Status

Why: A NotReady node means the kubelet stopped reporting. You must get on the node directly to diagnose.

# Get the node's internal/external IP from Step 1 output
ssh <SSH_USER>@<NODE_IP>

# Once on the node:
sudo systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago" --no-pager | tail -50
Expected output (healthy kubelet):
● kubelet.service - Kubernetes Kubelet
   Active: active (running) since ...
If kubelet is stopped or failed:
sudo systemctl restart kubelet
sudo systemctl status kubelet
If kubelet fails to restart: Check /var/log/messages or journalctl -xe for startup errors — often a bad kubeconfig or cert expiry. If kubelet is running but node is still NotReady: The kubelet is running but cannot reach the API server — check network/firewall rules.

Step 4: Check Disk and Memory Pressure

Why: Nodes evict pods and mark themselves NotReady when disk or memory crosses eviction thresholds. You need to free resources before the node recovers.

# Run on the node (SSH'd in from Step 3)
df -h
free -h

# Check which directories are consuming disk
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -20
du -sh /var/log/* 2>/dev/null | sort -rh | head -20

# Remove dangling Docker images if disk is the issue
sudo docker image prune -f
# Or for containerd:
sudo crictl rmi --prune
Expected output (disk cleared):
Filesystem      Size  Used Avail Use%
/dev/xvda1       50G   12G   38G  24%
If this fails: The disk may be full due to log files that cannot be pruned without application changes — escalate to the application team to reduce log verbosity.

Step 5: Check Container Runtime

Why: If the container runtime (containerd or Docker) has crashed or hung, the kubelet cannot manage containers and the node will be NotReady.

# Check containerd (most clusters)
sudo systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago" --no-pager | tail -30

# If using Docker (older clusters)
sudo systemctl status docker
sudo journalctl -u docker --since "30 minutes ago" --no-pager | tail -30

# Restart if needed
sudo systemctl restart containerd
Expected output (healthy containerd):
● containerd.service - containerd container runtime
   Active: active (running) since ...
If containerd cannot restart: There may be a corrupted state in /var/lib/containerd — this requires escalation as recovery involves data loss risk.

Step 6: Cordon and Drain the Node If Repair Will Take Time

Why: Cordoning prevents new pods from being scheduled on a broken node. Draining moves existing pods to healthy nodes, protecting SLA. Do this before attempting longer repairs.

# Cordon first — this is safe and reversible
kubectl cordon <NODE_NAME>

# Drain — this evicts pods gracefully (respect PodDisruptionBudgets)
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --grace-period=60
Expected output:
node/<NODE_NAME> cordoned
evicting pod <NAMESPACE>/<POD_NAME>
pod/<POD_NAME> evicted
node/<NODE_NAME> drained
If drain hangs: A pod may be violating a PodDisruptionBudget (PDB). Check with:
kubectl get pdb -A
You may need to force-drain: kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force — only do this after confirming with the service owner that the workload can tolerate disruption.

Step 7: Repair or Replace the Node

Why: If kubelet/runtime restarts do not resolve the issue, the node OS or hardware may be faulty and requires replacement.

# Option A: Uncordon after repair (node is fixed)
kubectl uncordon <NODE_NAME>
kubectl get nodes

# Option B: Terminate the node and let autoscaler replace it (cloud environments)
# AWS example — use the node's instance ID from 'kubectl describe node <NODE_NAME>'
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>

# After replacement, confirm the new node joins
kubectl get nodes -w
Expected output (uncordon):
node/<NODE_NAME> uncordoned
NAME                    STATUS   ROLES    AGE   VERSION
ip-10-0-1-42.internal   Ready    <none>   5m    v1.28.0
If this fails: The node pool autoscaler may not be configured — page the platform team to manually provision a replacement node.

Verification

# Confirm the issue is resolved
kubectl get nodes
kubectl get pods -A --field-selector spec.nodeName=<NODE_NAME>
Success looks like: Node shows Ready status. If node was replaced, all pods previously on the node are running on other nodes. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min SRE on-call "Kubernetes Node NotReady in , node , kubelet/runtime unrecoverable, runbook exhausted"
Data loss suspected Platform Lead "Data loss risk: node had local storage (emptyDir/hostPath) for stateful workloads"
Scope expanding beyond namespace Platform team "Multi-node impact: nodes NotReady, possible control plane or network failure"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Document what caused the node failure in the postmortem
  • Verify autoscaler replaced the node if it was terminated
  • Check whether any PodDisruptionBudgets need adjustment to allow faster drains

Common Mistakes

  1. Draining without cordoning first: If you run kubectl drain without first running kubectl cordon, the scheduler may place new pods on the node during the drain window — those pods will also be evicted moments later, causing unnecessary disruption. Always cordon before drain.
  2. Not verifying that workloads rescheduled: After draining, engineers often assume the pods are running elsewhere. Pods with unsatisfiable node selectors, taints, or resource requirements may be stuck Pending on other nodes. Always run kubectl get pods -A -o wide | grep Pending after a drain to confirm rescheduling succeeded.
  3. Restarting kubelet without checking why it stopped: A kubelet that crashed due to a bug will restart fine. A kubelet that failed due to certificate expiry or misconfiguration will fail again immediately. Check journalctl -u kubelet for the failure reason before restarting.

Prevention

  • Set up node health monitoring (node-problem-detector)
  • Configure cluster autoscaler to replace unhealthy nodes
  • Monitor kubelet certificate expiry
  • Set appropriate eviction thresholds in kubelet config
  • Use PodDisruptionBudgets to protect critical workloads during drains

Cross-References


Wiki Navigation