Decision Tree: Node Is NotReady¶
Category: Incident Triage Starting Question: "A Kubernetes node is in NotReady state — what's wrong?" Estimated traversal: 3-5 minutes Domains: kubernetes, linux-performance, networking
The Tree¶
A Kubernetes node is in NotReady state — what's wrong?
(kubectl get nodes — STATUS shows NotReady)
│
├── How long has it been NotReady?
│ │
│ ├── <2 minutes (just became NotReady)
│ │ └── May be transient — wait one more minute, then investigate
│ │ `kubectl get nodes -w` — watch for recovery
│ │
│ └── >2 minutes (sustained NotReady) → proceed below
│
├── Check node conditions: `kubectl describe node <node-name>`
│ (look at the Conditions section)
│ │
│ ├── MemoryPressure = True
│ │ │
│ │ ├── SSH to node: `ssh <node-ip>`
│ │ │ `free -h`
│ │ │ `ps aux --sort=-%mem | head -10`
│ │ │ │
│ │ │ ├── A runaway process consuming memory
│ │ │ │ └── ✅ ACTION: Kill Runaway Process / Cordon and Drain Node
│ │ │ │
│ │ │ └── Lots of pods, cumulative memory over node capacity
│ │ │ └── ✅ ACTION: Cordon and Drain Node — Reschedule Pods
│ │ │
│ │ └── Check for OOM kills: `dmesg | grep -i "killed process\|oom"`
│ │ └── OOM kill events → ✅ ACTION: Evict High-Memory Pods / Drain Node
│ │
│ ├── DiskPressure = True
│ │ │
│ │ ├── SSH to node: `df -h`
│ │ │ │
│ │ │ ├── /var/lib/kubelet or /var/lib/containerd full
│ │ │ │ │
│ │ │ │ ├── Prune stopped containers: `crictl rmi --prune` or `docker image prune -a`
│ │ │ │ │
│ │ │ │ └── Dangling images / overlays: `du -sh /var/lib/containerd/io.containerd.snapshotter*`
│ │ │ │ └── ✅ ACTION: Prune Container Images and Stopped Containers
│ │ │ │
│ │ │ ├── /var/log full
│ │ │ │ `du -sh /var/log/*`
│ │ │ │ └── ✅ ACTION: Rotate / Truncate Logs
│ │ │ │
│ │ │ └── Root filesystem full (/)
│ │ │ `du -sh /* 2>/dev/null | sort -rh | head -10`
│ │ │ └── ✅ ACTION: Identify and Remove Large Files / Expand Volume
│ │ │
│ │ └── Only kubelet reports pressure, df looks OK?
│ │ `df -i` — check inodes
│ │ └── Inodes exhausted → ✅ ACTION: Clean Up Inode-Consuming Files (small files / sockets)
│ │
│ ├── PIDPressure = True
│ │ │
│ │ ├── `cat /proc/sys/kernel/pid_max` — what's the PID limit?
│ │ │ `ps aux | wc -l` — how many processes?
│ │ │ │
│ │ │ └── Near or over limit → fork bomb or runaway process
│ │ │ `ps aux --sort=-%cpu | head -20`
│ │ │ └── ✅ ACTION: Kill Runaway Process / Cordon Node
│ │ │
│ ├── NetworkUnavailable = True
│ │ │
│ │ ├── CNI plugin issue — check CNI pod on that node
│ │ │ `kubectl get pods -n kube-system -o wide | grep <node-name>`
│ │ │ `kubectl logs -n kube-system <cni-pod>`
│ │ │ │
│ │ │ ├── CNI pod crashlooping → ✅ ACTION: Restart CNI Pod / Reinstall CNI
│ │ │ │
│ │ │ └── CNI config mismatch → ✅ ACTION: Fix CNI ConfigMap
│ │ │
│ │ └── Can node reach the cluster network gateway?
│ │ `ssh <node> -- ping -c5 <api-server-ip>`
│ │ └── Cannot ping → cloud VPC / routing issue
│ │ → ⚠️ ESCALATION: Cloud Provider Network
│ │
│ └── All Conditions OK but still NotReady?
│ → Kubelet is alive but cannot communicate with API server
│ `ssh <node> -- systemctl status kubelet`
│ │
│ ├── kubelet is not running → ✅ ACTION: Restart Kubelet
│ │
│ └── kubelet running but API unreachable
│ `ssh <node> -- curl -k https://<api-server-ip>:6443/healthz`
│ │
│ ├── Cannot reach API server → network partition
│ │ └── ⚠️ ESCALATION: Cloud Provider / Network Team
│ │
│ └── API server reachable but kubelet cert expired
│ `openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates`
│ └── ✅ ACTION: Rotate Kubelet Client Certificate
│
├── Check kernel for hardware / OS issues
│ `dmesg | tail -50` (look for hardware errors, kernel panics, filesystem errors)
│ │
│ ├── Kernel panic / BUG / OOPS
│ │ └── ⚠️ ESCALATION: Drain Node + Engage Infrastructure Team
│ │
│ ├── EXT4 / XFS filesystem errors
│ │ └── ✅ ACTION: Cordon, Drain, Run fsck (requires unmount)
│ │
│ └── Hardware errors (MCE / EDAC / disk errors)
│ └── ⚠️ ESCALATION: Replace Node (cloud: terminate instance)
│
└── Are pods being rescheduled?
`kubectl get pods --all-namespaces --field-selector=spec.nodeName=<node-name>`
│
├── Pods still on node → they will be evicted after 5 min (default tolerations)
│ Expedite: ✅ ACTION: Cordon and Drain Node
│
└── Pods already evicted → check if they rescheduled successfully elsewhere
`kubectl get pods --all-namespaces | grep -v Running | grep -v Completed`
└── Stuck in Pending → check other nodes' capacity
Node Details¶
Check 1: Node conditions¶
Command: kubectl describe node <node-name> | grep -A20 "Conditions:"
What you're looking for: The five standard conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready. Any showing True when it should be False (or vice versa for Ready) indicates the root cause.
Common pitfall: The conditions reflect what kubelet last reported. If kubelet itself is down, conditions may be stale. Check the LastHeartbeatTime — if it's more than 40s ago, kubelet is not communicating.
Check 2: Kubelet status¶
Command: systemctl status kubelet and journalctl -u kubelet --since "10 minutes ago" -n 100
What you're looking for: "Active: running" vs failed/inactive. In journal: "Failed to sync node status", "certificate expired", "connection refused to API server".
Common pitfall: kubelet may be running but unable to reach the API server. Check both the process status AND whether it can connect: curl -k https://<api-server>:6443/healthz from the node.
Check 3: Disk pressure — inode exhaustion¶
Command: df -i — this shows inode usage, not block usage. A filesystem can be 30% full by blocks but 100% full by inodes.
What you're looking for: Any filesystem at 100% IUse%. Common cause: tens of thousands of small files (container logs per rotation, sockets, pid files).
Common pitfall: df -h shows plenty of space but the node is in DiskPressure. Always check df -i as a second step.
Check 4: Container/image disk usage¶
Command: For containerd: crictl images | awk '{print $4}' | sort -rh | head. For docker: docker system df. For raw overlay: du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/.
What you're looking for: Large number of stopped containers, unused image layers, or dangling volumes accumulating.
Common pitfall: Kubernetes garbage collects images, but only when disk usage exceeds imageGCHighThresholdPercent (default: 85%). Nodes can be in DiskPressure before GC kicks in.
Check 5: Kernel messages¶
Command: dmesg -T | tail -100 | grep -iE "error|warn|oom|panic|bug|killed|fail"
What you're looking for: Hardware memory errors (ECC/MCE), filesystem corruption (EXT4-fs error), kernel panics, NVMe/disk errors, or OOM kill events.
Common pitfall: dmesg entries are timestamped relative to boot or wall-clock (with -T). Correlate with the time the node went NotReady to distinguish old errors from current issues.
Check 6: CNI health¶
Command: kubectl get pods -n kube-system -o wide | grep -E "calico|cilium|flannel|weave" | grep <node-name> — find the CNI agent pod on the affected node, then kubectl logs -n kube-system <cni-pod>.
What you're looking for: CNI plugin crash logs, "route not found", "interface not found", IPAM errors.
Common pitfall: CNI pods run as DaemonSets — a NotReady node can cause its own CNI pod to be evicted or stuck, creating a chicken-and-egg. You may need to SSH to the node and inspect the CNI manually.
Terminal Actions¶
Action: Restart Kubelet¶
Do:
1. SSH to the node: ssh ubuntu@<node-ip> or use your cloud provider's session manager
2. sudo systemctl restart kubelet
3. Monitor: sudo journalctl -u kubelet -f — watch for "Successfully registered node"
4. Back in kubectl: kubectl get nodes -w — watch for Ready status
Verify: kubectl get node <name> shows STATUS = Ready within 30-60 seconds.
Runbook: node_not_ready.md
Action: Cordon and Drain Node¶
Do:
1. Prevent new pods from scheduling: kubectl cordon <node-name>
2. Evict existing pods gracefully: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60
3. Investigate the root cause (disk, memory, kernel)
4. After fixing: kubectl uncordon <node-name> to re-enable scheduling
Verify: kubectl get pods -o wide | grep <node-name> shows no non-DaemonSet pods remaining.
Action: Prune Container Images and Stopped Containers¶
Do:
1. sudo crictl rmi --prune — remove unused images (containerd)
2. sudo crictl rm $(sudo crictl ps -a -q --state exited) — remove stopped containers
3. Check space freed: df -h /var/lib/containerd
4. If still full: sudo du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head
Verify: df -h shows filesystem below 70%. kubectl get node <name> transitions to Ready.
Action: Rotate / Truncate Logs¶
Do:
1. Check largest logs: sudo du -sh /var/log/* | sort -rh | head -10
2. Force log rotation: sudo logrotate -f /etc/logrotate.conf
3. For container logs: sudo journalctl --vacuum-size=500M
4. Truncate a specific large log: sudo truncate -s 0 /var/log/syslog (only if you've read it)
Verify: df -h shows filesystem below 70%.
Action: Kill Runaway Process / Cordon Node¶
Do:
1. Identify: ps aux --sort=-%mem | head -5 or ps aux --sort=-%cpu | head -5
2. Kill gracefully: sudo kill -15 <pid>
3. If unresponsive: sudo kill -9 <pid>
4. Cordon the node while investigating root cause: kubectl cordon <node-name>
Verify: free -h shows memory recovery. Node transitions to Ready if it was MemoryPressure only.
Action: Rotate Kubelet Client Certificate¶
Do:
1. Check cert expiry: sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
2. If expired, enable auto-renewal: sudo kubeadm certs renew all (kubeadm clusters)
3. Or approve pending CSR: kubectl get csr — find pending CSR for the node, then kubectl certificate approve <csr-name>
4. Restart kubelet: sudo systemctl restart kubelet
Verify: kubectl get node <name> shows Ready. openssl x509 -noout -dates shows future NotAfter.
Escalation: Cloud Provider Network¶
When: Node is running but cannot reach the API server, ping fails to other nodes, CNI is healthy.
Who: Cloud infrastructure team, cloud provider support
Include in page: Node name, private IP, VPC/subnet, ip route show output from node, cloud provider network event log
Escalation: Drain Node + Engage Infrastructure Team¶
When: Kernel panic, BUG, or hardware MCE errors in dmesg.
Who: Infrastructure / platform SRE team
Include in page: dmesg | tail -200 output, node name, instance type, cloud provider AZ, time of first error
Edge Cases¶
- Cloud spot/preemptible instance terminated: The node was terminated by the cloud provider. Check cloud console for "Spot interruption" events. This is expected behavior — verify cluster autoscaler replaces the node.
- Node NotReady after etcd issues: If etcd is degraded, the API server may not respond, causing nodes to appear NotReady even though they're healthy. Check etcd health before blaming nodes.
- NotReady after upgrade: A Kubernetes version upgrade may require a kubelet restart or config update. Check
kubelet --versionon the node matches the API server version (±1 minor version). - Network plugin (CNI) upgrade caused NotReady: During CNI upgrade, the DaemonSet rolls out across nodes. Each node is briefly NotReady. This is expected and should self-resolve — do not intervene unless a node is stuck >5 min.
- All nodes NotReady simultaneously: This is a control plane issue, not a node issue. Check API server and etcd health first.
Cross-References¶
- Topic Packs: k8s-node-lifecycle, k8s-ops, linux-performance, linux-ops-storage, networking
- Runbooks: node_not_ready.md, pod_eviction.md, oomkilled.md