Decision Tree: Node Is NotReady¶

Category: Incident Triage Starting Question: "A Kubernetes node is in NotReady state — what's wrong?" Estimated traversal: 3-5 minutes Domains: kubernetes, linux-performance, networking

The Tree¶

A Kubernetes node is in NotReady state — what's wrong?
(kubectl get nodes — STATUS shows NotReady)
│
├── How long has it been NotReady?
│   │
│   ├── <2 minutes (just became NotReady)
│   │   └── May be transient — wait one more minute, then investigate
│   │       `kubectl get nodes -w` — watch for recovery
│   │
│   └── >2 minutes (sustained NotReady) → proceed below
│
├── Check node conditions: `kubectl describe node <node-name>`
│   (look at the Conditions section)
│   │
│   ├── MemoryPressure = True
│   │   │
│   │   ├── SSH to node: `ssh <node-ip>`
│   │   │   `free -h`
│   │   │   `ps aux --sort=-%mem | head -10`
│   │   │   │
│   │   │   ├── A runaway process consuming memory
│   │   │   │   └── ✅ ACTION: Kill Runaway Process / Cordon and Drain Node
│   │   │   │
│   │   │   └── Lots of pods, cumulative memory over node capacity
│   │   │       └── ✅ ACTION: Cordon and Drain Node — Reschedule Pods
│   │   │
│   │   └── Check for OOM kills: `dmesg | grep -i "killed process\|oom"`
│   │       └── OOM kill events → ✅ ACTION: Evict High-Memory Pods / Drain Node
│   │
│   ├── DiskPressure = True
│   │   │
│   │   ├── SSH to node: `df -h`
│   │   │   │
│   │   │   ├── /var/lib/kubelet or /var/lib/containerd full
│   │   │   │   │
│   │   │   │   ├── Prune stopped containers: `crictl rmi --prune` or `docker image prune -a`
│   │   │   │   │
│   │   │   │   └── Dangling images / overlays: `du -sh /var/lib/containerd/io.containerd.snapshotter*`
│   │   │   │       └── ✅ ACTION: Prune Container Images and Stopped Containers
│   │   │   │
│   │   │   ├── /var/log full
│   │   │   │   `du -sh /var/log/*`
│   │   │   │   └── ✅ ACTION: Rotate / Truncate Logs
│   │   │   │
│   │   │   └── Root filesystem full (/)
│   │   │       `du -sh /* 2>/dev/null | sort -rh | head -10`
│   │   │       └── ✅ ACTION: Identify and Remove Large Files / Expand Volume
│   │   │
│   │   └── Only kubelet reports pressure, df looks OK?
│   │       `df -i` — check inodes
│   │       └── Inodes exhausted → ✅ ACTION: Clean Up Inode-Consuming Files (small files / sockets)
│   │
│   ├── PIDPressure = True
│   │   │
│   │   ├── `cat /proc/sys/kernel/pid_max` — what's the PID limit?
│   │   │   `ps aux | wc -l` — how many processes?
│   │   │   │
│   │   │   └── Near or over limit → fork bomb or runaway process
│   │   │       `ps aux --sort=-%cpu | head -20`
│   │   │       └── ✅ ACTION: Kill Runaway Process / Cordon Node
│   │   │
│   ├── NetworkUnavailable = True
│   │   │
│   │   ├── CNI plugin issue — check CNI pod on that node
│   │   │   `kubectl get pods -n kube-system -o wide | grep <node-name>`
│   │   │   `kubectl logs -n kube-system <cni-pod>`
│   │   │   │
│   │   │   ├── CNI pod crashlooping → ✅ ACTION: Restart CNI Pod / Reinstall CNI
│   │   │   │
│   │   │   └── CNI config mismatch → ✅ ACTION: Fix CNI ConfigMap
│   │   │
│   │   └── Can node reach the cluster network gateway?
│   │       `ssh <node> -- ping -c5 <api-server-ip>`
│   │       └── Cannot ping → cloud VPC / routing issue
│   │           → ⚠️ ESCALATION: Cloud Provider Network
│   │
│   └── All Conditions OK but still NotReady?
│       → Kubelet is alive but cannot communicate with API server
│       `ssh <node> -- systemctl status kubelet`
│       │
│       ├── kubelet is not running → ✅ ACTION: Restart Kubelet
│       │
│       └── kubelet running but API unreachable
│           `ssh <node> -- curl -k https://<api-server-ip>:6443/healthz`
│           │
│           ├── Cannot reach API server → network partition
│           │   └── ⚠️ ESCALATION: Cloud Provider / Network Team
│           │
│           └── API server reachable but kubelet cert expired
│               `openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates`
│               └── ✅ ACTION: Rotate Kubelet Client Certificate
│
├── Check kernel for hardware / OS issues
│   `dmesg | tail -50` (look for hardware errors, kernel panics, filesystem errors)
│   │
│   ├── Kernel panic / BUG / OOPS
│   │   └── ⚠️ ESCALATION: Drain Node + Engage Infrastructure Team
│   │
│   ├── EXT4 / XFS filesystem errors
│   │   └── ✅ ACTION: Cordon, Drain, Run fsck (requires unmount)
│   │
│   └── Hardware errors (MCE / EDAC / disk errors)
│       └── ⚠️ ESCALATION: Replace Node (cloud: terminate instance)
│
└── Are pods being rescheduled?
    `kubectl get pods --all-namespaces --field-selector=spec.nodeName=<node-name>`
    │
    ├── Pods still on node → they will be evicted after 5 min (default tolerations)
    │   Expedite: ✅ ACTION: Cordon and Drain Node
    │
    └── Pods already evicted → check if they rescheduled successfully elsewhere
        `kubectl get pods --all-namespaces | grep -v Running | grep -v Completed`
        └── Stuck in Pending → check other nodes' capacity

Node Details¶

Check 1: Node conditions¶

Command: kubectl describe node <node-name> | grep -A20 "Conditions:" What you're looking for: The five standard conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready. Any showing True when it should be False (or vice versa for Ready) indicates the root cause. Common pitfall: The conditions reflect what kubelet last reported. If kubelet itself is down, conditions may be stale. Check the LastHeartbeatTime — if it's more than 40s ago, kubelet is not communicating.

Check 2: Kubelet status¶

Command: systemctl status kubelet and journalctl -u kubelet --since "10 minutes ago" -n 100 What you're looking for: "Active: running" vs failed/inactive. In journal: "Failed to sync node status", "certificate expired", "connection refused to API server". Common pitfall: kubelet may be running but unable to reach the API server. Check both the process status AND whether it can connect: curl -k https://<api-server>:6443/healthz from the node.

Check 3: Disk pressure — inode exhaustion¶

Command: df -i — this shows inode usage, not block usage. A filesystem can be 30% full by blocks but 100% full by inodes. What you're looking for: Any filesystem at 100% IUse%. Common cause: tens of thousands of small files (container logs per rotation, sockets, pid files). Common pitfall: df -h shows plenty of space but the node is in DiskPressure. Always check df -i as a second step.

Check 4: Container/image disk usage¶

Command: For containerd: crictl images | awk '{print $4}' | sort -rh | head. For docker: docker system df. For raw overlay: du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/. What you're looking for: Large number of stopped containers, unused image layers, or dangling volumes accumulating. Common pitfall: Kubernetes garbage collects images, but only when disk usage exceeds imageGCHighThresholdPercent (default: 85%). Nodes can be in DiskPressure before GC kicks in.

Check 5: Kernel messages¶

Command: dmesg -T | tail -100 | grep -iE "error|warn|oom|panic|bug|killed|fail" What you're looking for: Hardware memory errors (ECC/MCE), filesystem corruption (EXT4-fs error), kernel panics, NVMe/disk errors, or OOM kill events. Common pitfall: dmesg entries are timestamped relative to boot or wall-clock (with -T). Correlate with the time the node went NotReady to distinguish old errors from current issues.

Check 6: CNI health¶

Command: kubectl get pods -n kube-system -o wide | grep -E "calico|cilium|flannel|weave" | grep <node-name> — find the CNI agent pod on the affected node, then kubectl logs -n kube-system <cni-pod>. What you're looking for: CNI plugin crash logs, "route not found", "interface not found", IPAM errors. Common pitfall: CNI pods run as DaemonSets — a NotReady node can cause its own CNI pod to be evicted or stuck, creating a chicken-and-egg. You may need to SSH to the node and inspect the CNI manually.

Terminal Actions¶

Action: Restart Kubelet¶

Do: 1. SSH to the node: ssh ubuntu@<node-ip> or use your cloud provider's session manager 2. sudo systemctl restart kubelet 3. Monitor: sudo journalctl -u kubelet -f — watch for "Successfully registered node" 4. Back in kubectl: kubectl get nodes -w — watch for Ready status Verify: kubectl get node <name> shows STATUS = Ready within 30-60 seconds. Runbook: node_not_ready.md

Action: Cordon and Drain Node¶

Do: 1. Prevent new pods from scheduling: kubectl cordon <node-name> 2. Evict existing pods gracefully: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60 3. Investigate the root cause (disk, memory, kernel) 4. After fixing: kubectl uncordon <node-name> to re-enable scheduling Verify: kubectl get pods -o wide | grep <node-name> shows no non-DaemonSet pods remaining.

Action: Prune Container Images and Stopped Containers¶

Do: 1. sudo crictl rmi --prune — remove unused images (containerd) 2. sudo crictl rm $(sudo crictl ps -a -q --state exited) — remove stopped containers 3. Check space freed: df -h /var/lib/containerd 4. If still full: sudo du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head Verify: df -h shows filesystem below 70%. kubectl get node <name> transitions to Ready.

Action: Rotate / Truncate Logs¶

Do: 1. Check largest logs: sudo du -sh /var/log/* | sort -rh | head -10 2. Force log rotation: sudo logrotate -f /etc/logrotate.conf 3. For container logs: sudo journalctl --vacuum-size=500M 4. Truncate a specific large log: sudo truncate -s 0 /var/log/syslog (only if you've read it) Verify: df -h shows filesystem below 70%.

Action: Kill Runaway Process / Cordon Node¶

Do: 1. Identify: ps aux --sort=-%mem | head -5 or ps aux --sort=-%cpu | head -5 2. Kill gracefully: sudo kill -15 <pid> 3. If unresponsive: sudo kill -9 <pid> 4. Cordon the node while investigating root cause: kubectl cordon <node-name> Verify: free -h shows memory recovery. Node transitions to Ready if it was MemoryPressure only.

Action: Rotate Kubelet Client Certificate¶

Do: 1. Check cert expiry: sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates 2. If expired, enable auto-renewal: sudo kubeadm certs renew all (kubeadm clusters) 3. Or approve pending CSR: kubectl get csr — find pending CSR for the node, then kubectl certificate approve <csr-name> 4. Restart kubelet: sudo systemctl restart kubelet Verify: kubectl get node <name> shows Ready. openssl x509 -noout -dates shows future NotAfter.

Escalation: Cloud Provider Network¶

When: Node is running but cannot reach the API server, ping fails to other nodes, CNI is healthy. Who: Cloud infrastructure team, cloud provider support Include in page: Node name, private IP, VPC/subnet, ip route show output from node, cloud provider network event log

Escalation: Drain Node + Engage Infrastructure Team¶

When: Kernel panic, BUG, or hardware MCE errors in dmesg. Who: Infrastructure / platform SRE team Include in page: dmesg | tail -200 output, node name, instance type, cloud provider AZ, time of first error

Edge Cases¶

Cloud spot/preemptible instance terminated: The node was terminated by the cloud provider. Check cloud console for "Spot interruption" events. This is expected behavior — verify cluster autoscaler replaces the node.
Node NotReady after etcd issues: If etcd is degraded, the API server may not respond, causing nodes to appear NotReady even though they're healthy. Check etcd health before blaming nodes.
NotReady after upgrade: A Kubernetes version upgrade may require a kubelet restart or config update. Check kubelet --version on the node matches the API server version (±1 minor version).
Network plugin (CNI) upgrade caused NotReady: During CNI upgrade, the DaemonSet rolls out across nodes. Each node is briefly NotReady. This is expected and should self-resolve — do not intervene unless a node is stuck >5 min.
All nodes NotReady simultaneously: This is a control plane issue, not a node issue. Check API server and etcd health first.

Cross-References¶

Topic Packs: k8s-node-lifecycle, k8s-ops, linux-performance, linux-ops-storage, networking
Runbooks: node_not_ready.md, pod_eviction.md, oomkilled.md