- k8s
- l1
- runbook
- k8s-node-lifecycle
- node-maintenance --- Portal | Level: L1: Foundations | Topics: Node Lifecycle & Maintenance | Domain: Kubernetes
Runbook: Node NotReady¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | kube_node_status_condition{condition="Ready",status="false"} == 1 |
| Severity | P1 |
| Est. Resolution Time | 20-45 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
If output shows: multiple nodes NotReady simultaneously → This is likely a control plane or network issue; escalate immediately and see etcd-latency.md If output shows: a single node NotReady → Continue with the steps belowStep 1: Identify Which Nodes Are NotReady¶
Why: Knowing the count, zone, and node pool helps determine if this is a single hardware failure or a broader infrastructure problem.
kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,READY:.status.conditions[-1].status,AGE:.metadata.creationTimestamp,VERSION:.status.nodeInfo.kubeletVersion' | grep -v " True"
kubectl get nodes as a simpler fallback to identify the node name.
Step 2: Check Node Conditions and Events¶
Why: Kubernetes tracks specific conditions (DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable) that tell you exactly what the node is reporting as the problem.
Expected output — look for the Conditions section:Conditions:
Type Status Reason
---- ------ ------
MemoryPressure False KubeletHasSufficientMemory
DiskPressure True KubeletHasDiskPressure <-- problem here
PIDPressure False KubeletHasSufficientPID
Ready False KubeletNotReady
Step 3: SSH Into the Node and Check Kubelet Status¶
Why: A NotReady node means the kubelet stopped reporting. You must get on the node directly to diagnose.
# Get the node's internal/external IP from Step 1 output
ssh <SSH_USER>@<NODE_IP>
# Once on the node:
sudo systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago" --no-pager | tail -50
/var/log/messages or journalctl -xe for startup errors — often a bad kubeconfig or cert expiry.
If kubelet is running but node is still NotReady: The kubelet is running but cannot reach the API server — check network/firewall rules.
Step 4: Check Disk and Memory Pressure¶
Why: Nodes evict pods and mark themselves NotReady when disk or memory crosses eviction thresholds. You need to free resources before the node recovers.
# Run on the node (SSH'd in from Step 3)
df -h
free -h
# Check which directories are consuming disk
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -20
du -sh /var/log/* 2>/dev/null | sort -rh | head -20
# Remove dangling Docker images if disk is the issue
sudo docker image prune -f
# Or for containerd:
sudo crictl rmi --prune
Step 5: Check Container Runtime¶
Why: If the container runtime (containerd or Docker) has crashed or hung, the kubelet cannot manage containers and the node will be NotReady.
# Check containerd (most clusters)
sudo systemctl status containerd
sudo journalctl -u containerd --since "30 minutes ago" --no-pager | tail -30
# If using Docker (older clusters)
sudo systemctl status docker
sudo journalctl -u docker --since "30 minutes ago" --no-pager | tail -30
# Restart if needed
sudo systemctl restart containerd
/var/lib/containerd — this requires escalation as recovery involves data loss risk.
Step 6: Cordon and Drain the Node If Repair Will Take Time¶
Why: Cordoning prevents new pods from being scheduled on a broken node. Draining moves existing pods to healthy nodes, protecting SLA. Do this before attempting longer repairs.
# Cordon first — this is safe and reversible
kubectl cordon <NODE_NAME>
# Drain — this evicts pods gracefully (respect PodDisruptionBudgets)
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --grace-period=60
node/<NODE_NAME> cordoned
evicting pod <NAMESPACE>/<POD_NAME>
pod/<POD_NAME> evicted
node/<NODE_NAME> drained
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force — only do this after confirming with the service owner that the workload can tolerate disruption.
Step 7: Repair or Replace the Node¶
Why: If kubelet/runtime restarts do not resolve the issue, the node OS or hardware may be faulty and requires replacement.
# Option A: Uncordon after repair (node is fixed)
kubectl uncordon <NODE_NAME>
kubectl get nodes
# Option B: Terminate the node and let autoscaler replace it (cloud environments)
# AWS example — use the node's instance ID from 'kubectl describe node <NODE_NAME>'
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>
# After replacement, confirm the new node joins
kubectl get nodes -w
node/<NODE_NAME> uncordoned
NAME STATUS ROLES AGE VERSION
ip-10-0-1-42.internal Ready <none> 5m v1.28.0
Verification¶
# Confirm the issue is resolved
kubectl get nodes
kubectl get pods -A --field-selector spec.nodeName=<NODE_NAME>
Ready status. If node was replaced, all pods previously on the node are running on other nodes.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | SRE on-call | "Kubernetes Node NotReady in |
| Data loss suspected | Platform Lead | "Data loss risk: node |
| Scope expanding beyond namespace | Platform team | "Multi-node impact: |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Document what caused the node failure in the postmortem
- Verify autoscaler replaced the node if it was terminated
- Check whether any PodDisruptionBudgets need adjustment to allow faster drains
Common Mistakes¶
- Draining without cordoning first: If you run
kubectl drainwithout first runningkubectl cordon, the scheduler may place new pods on the node during the drain window — those pods will also be evicted moments later, causing unnecessary disruption. Always cordon before drain. - Not verifying that workloads rescheduled: After draining, engineers often assume the pods are running elsewhere. Pods with unsatisfiable node selectors, taints, or resource requirements may be stuck Pending on other nodes. Always run
kubectl get pods -A -o wide | grep Pendingafter a drain to confirm rescheduling succeeded. - Restarting kubelet without checking why it stopped: A kubelet that crashed due to a bug will restart fine. A kubelet that failed due to certificate expiry or misconfiguration will fail again immediately. Check
journalctl -u kubeletfor the failure reason before restarting.
Prevention¶
- Set up node health monitoring (node-problem-detector)
- Configure cluster autoscaler to replace unhealthy nodes
- Monitor kubelet certificate expiry
- Set appropriate eviction thresholds in kubelet config
- Use PodDisruptionBudgets to protect critical workloads during drains
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: etcd-latency.md — if multiple nodes are NotReady simultaneously
- Related Runbook: pod-crashloop.md — if pods crash after node recovery
- Related Runbook: deploy-stuck.md — if deployments stall after node drain
Wiki Navigation¶
Related Content¶
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Node Lifecycle & Maintenance
- Kubernetes Node Lifecycle (Topic Pack, L2) — Node Lifecycle & Maintenance
- Kubernetes Node Lifecycle Flashcards (CLI) (flashcard_deck, L1) — Node Lifecycle & Maintenance
- Kubernetes Ops (Production) (Topic Pack, L2) — Node Lifecycle & Maintenance
- Node Maintenance (Topic Pack, L1) — Node Lifecycle & Maintenance
- Skillcheck: Kubernetes Under the Covers (Assessment, L2) — Node Lifecycle & Maintenance
Pages that link here¶
- Decision Tree: Disk Is Filling Up
- Decision Tree: Memory Usage Is High
- Decision Tree: Node Is NotReady
- Kubernetes Node Lifecycle
- Kubernetes Node Lifecycle - Primer
- Kubernetes Node Lifecycle -- Street Ops
- Kubernetes Under the Covers
- Node Maintenance
- Node Maintenance - Primer
- On-Call Survival Guides
- Operational Runbooks
- Runbook: Deployment Stuck / Rollout Stalled
- Runbook: OOMKilled Container
- Runbook: PVC Stuck in Pending
- Runbook: Pod CrashLoopBackOff