Skip to content

Decision Tree: Node Is NotReady

Category: Incident Triage Starting Question: "A Kubernetes node is in NotReady state — what's wrong?" Estimated traversal: 3-5 minutes Domains: kubernetes, linux-performance, networking


The Tree

A Kubernetes node is in NotReady state  what's wrong?
(kubectl get nodes — STATUS shows NotReady)

├── How long has it been NotReady?
│   │
│   ├── <2 minutes (just became NotReady)
│   │   └── May be transient — wait one more minute, then investigate
│   │       `kubectl get nodes -w` — watch for recovery
│   │
│   └── >2 minutes (sustained NotReady) → proceed below

├── Check node conditions: `kubectl describe node <node-name>`
│   (look at the Conditions section)
│   │
│   ├── MemoryPressure = True
│   │   │
│   │   ├── SSH to node: `ssh <node-ip>`
│   │   │   `free -h`
│   │   │   `ps aux --sort=-%mem | head -10`
│   │   │   │
│   │   │   ├── A runaway process consuming memory
│   │   │   │   └── ✅ ACTION: Kill Runaway Process / Cordon and Drain Node
│   │   │   │
│   │   │   └── Lots of pods, cumulative memory over node capacity
│   │   │       └── ✅ ACTION: Cordon and Drain Node — Reschedule Pods
│   │   │
│   │   └── Check for OOM kills: `dmesg | grep -i "killed process\|oom"`
│   │       └── OOM kill events → ✅ ACTION: Evict High-Memory Pods / Drain Node
│   │
│   ├── DiskPressure = True
│   │   │
│   │   ├── SSH to node: `df -h`
│   │   │   │
│   │   │   ├── /var/lib/kubelet or /var/lib/containerd full
│   │   │   │   │
│   │   │   │   ├── Prune stopped containers: `crictl rmi --prune` or `docker image prune -a`
│   │   │   │   │
│   │   │   │   └── Dangling images / overlays: `du -sh /var/lib/containerd/io.containerd.snapshotter*`
│   │   │   │       └── ✅ ACTION: Prune Container Images and Stopped Containers
│   │   │   │
│   │   │   ├── /var/log full
│   │   │   │   `du -sh /var/log/*`
│   │   │   │   └── ✅ ACTION: Rotate / Truncate Logs
│   │   │   │
│   │   │   └── Root filesystem full (/)
│   │   │       `du -sh /* 2>/dev/null | sort -rh | head -10`
│   │   │       └── ✅ ACTION: Identify and Remove Large Files / Expand Volume
│   │   │
│   │   └── Only kubelet reports pressure, df looks OK?
│   │       `df -i` — check inodes
│   │       └── Inodes exhausted → ✅ ACTION: Clean Up Inode-Consuming Files (small files / sockets)
│   │
│   ├── PIDPressure = True
│   │   │
│   │   ├── `cat /proc/sys/kernel/pid_max` — what's the PID limit?
         `ps aux | wc -l`  how many processes?
                  └── Near or over limit  fork bomb or runaway process
             `ps aux --sort=-%cpu | head -20`
             └──  ACTION: Kill Runaway Process / Cordon Node
         ├── NetworkUnavailable = True
            ├── CNI plugin issue  check CNI pod on that node
         `kubectl get pods -n kube-system -o wide | grep <node-name>`
         `kubectl logs -n kube-system <cni-pod>`
                  ├── CNI pod crashlooping   ACTION: Restart CNI Pod / Reinstall CNI
                  └── CNI config mismatch   ACTION: Fix CNI ConfigMap
            └── Can node reach the cluster network gateway?
          `ssh <node> -- ping -c5 <api-server-ip>`
          └── Cannot ping  cloud VPC / routing issue
               ⚠️ ESCALATION: Cloud Provider Network
      └── All Conditions OK but still NotReady?
        Kubelet is alive but cannot communicate with API server
       `ssh <node> -- systemctl status kubelet`
              ├── kubelet is not running   ACTION: Restart Kubelet
              └── kubelet running but API unreachable
           `ssh <node> -- curl -k https://<api-server-ip>:6443/healthz`
                      ├── Cannot reach API server  network partition
              └── ⚠️ ESCALATION: Cloud Provider / Network Team
                      └── API server reachable but kubelet cert expired
               `openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates`
               └──  ACTION: Rotate Kubelet Client Certificate
├── Check kernel for hardware / OS issues
   `dmesg | tail -50` (look for hardware errors, kernel panics, filesystem errors)
      ├── Kernel panic / BUG / OOPS
      └── ⚠️ ESCALATION: Drain Node + Engage Infrastructure Team
      ├── EXT4 / XFS filesystem errors
      └──  ACTION: Cordon, Drain, Run fsck (requires unmount)
      └── Hardware errors (MCE / EDAC / disk errors)
       └── ⚠️ ESCALATION: Replace Node (cloud: terminate instance)
└── Are pods being rescheduled?
    `kubectl get pods --all-namespaces --field-selector=spec.nodeName=<node-name>`
        ├── Pods still on node  they will be evicted after 5 min (default tolerations)
       Expedite:  ACTION: Cordon and Drain Node
        └── Pods already evicted  check if they rescheduled successfully elsewhere
        `kubectl get pods --all-namespaces | grep -v Running | grep -v Completed`
        └── Stuck in Pending  check other nodes' capacity

Node Details

Check 1: Node conditions

Command: kubectl describe node <node-name> | grep -A20 "Conditions:" What you're looking for: The five standard conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready. Any showing True when it should be False (or vice versa for Ready) indicates the root cause. Common pitfall: The conditions reflect what kubelet last reported. If kubelet itself is down, conditions may be stale. Check the LastHeartbeatTime — if it's more than 40s ago, kubelet is not communicating.

Check 2: Kubelet status

Command: systemctl status kubelet and journalctl -u kubelet --since "10 minutes ago" -n 100 What you're looking for: "Active: running" vs failed/inactive. In journal: "Failed to sync node status", "certificate expired", "connection refused to API server". Common pitfall: kubelet may be running but unable to reach the API server. Check both the process status AND whether it can connect: curl -k https://<api-server>:6443/healthz from the node.

Check 3: Disk pressure — inode exhaustion

Command: df -i — this shows inode usage, not block usage. A filesystem can be 30% full by blocks but 100% full by inodes. What you're looking for: Any filesystem at 100% IUse%. Common cause: tens of thousands of small files (container logs per rotation, sockets, pid files). Common pitfall: df -h shows plenty of space but the node is in DiskPressure. Always check df -i as a second step.

Check 4: Container/image disk usage

Command: For containerd: crictl images | awk '{print $4}' | sort -rh | head. For docker: docker system df. For raw overlay: du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/. What you're looking for: Large number of stopped containers, unused image layers, or dangling volumes accumulating. Common pitfall: Kubernetes garbage collects images, but only when disk usage exceeds imageGCHighThresholdPercent (default: 85%). Nodes can be in DiskPressure before GC kicks in.

Check 5: Kernel messages

Command: dmesg -T | tail -100 | grep -iE "error|warn|oom|panic|bug|killed|fail" What you're looking for: Hardware memory errors (ECC/MCE), filesystem corruption (EXT4-fs error), kernel panics, NVMe/disk errors, or OOM kill events. Common pitfall: dmesg entries are timestamped relative to boot or wall-clock (with -T). Correlate with the time the node went NotReady to distinguish old errors from current issues.

Check 6: CNI health

Command: kubectl get pods -n kube-system -o wide | grep -E "calico|cilium|flannel|weave" | grep <node-name> — find the CNI agent pod on the affected node, then kubectl logs -n kube-system <cni-pod>. What you're looking for: CNI plugin crash logs, "route not found", "interface not found", IPAM errors. Common pitfall: CNI pods run as DaemonSets — a NotReady node can cause its own CNI pod to be evicted or stuck, creating a chicken-and-egg. You may need to SSH to the node and inspect the CNI manually.


Terminal Actions

Action: Restart Kubelet

Do: 1. SSH to the node: ssh ubuntu@<node-ip> or use your cloud provider's session manager 2. sudo systemctl restart kubelet 3. Monitor: sudo journalctl -u kubelet -f — watch for "Successfully registered node" 4. Back in kubectl: kubectl get nodes -w — watch for Ready status Verify: kubectl get node <name> shows STATUS = Ready within 30-60 seconds. Runbook: node_not_ready.md

Action: Cordon and Drain Node

Do: 1. Prevent new pods from scheduling: kubectl cordon <node-name> 2. Evict existing pods gracefully: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60 3. Investigate the root cause (disk, memory, kernel) 4. After fixing: kubectl uncordon <node-name> to re-enable scheduling Verify: kubectl get pods -o wide | grep <node-name> shows no non-DaemonSet pods remaining.

Action: Prune Container Images and Stopped Containers

Do: 1. sudo crictl rmi --prune — remove unused images (containerd) 2. sudo crictl rm $(sudo crictl ps -a -q --state exited) — remove stopped containers 3. Check space freed: df -h /var/lib/containerd 4. If still full: sudo du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head Verify: df -h shows filesystem below 70%. kubectl get node <name> transitions to Ready.

Action: Rotate / Truncate Logs

Do: 1. Check largest logs: sudo du -sh /var/log/* | sort -rh | head -10 2. Force log rotation: sudo logrotate -f /etc/logrotate.conf 3. For container logs: sudo journalctl --vacuum-size=500M 4. Truncate a specific large log: sudo truncate -s 0 /var/log/syslog (only if you've read it) Verify: df -h shows filesystem below 70%.

Action: Kill Runaway Process / Cordon Node

Do: 1. Identify: ps aux --sort=-%mem | head -5 or ps aux --sort=-%cpu | head -5 2. Kill gracefully: sudo kill -15 <pid> 3. If unresponsive: sudo kill -9 <pid> 4. Cordon the node while investigating root cause: kubectl cordon <node-name> Verify: free -h shows memory recovery. Node transitions to Ready if it was MemoryPressure only.

Action: Rotate Kubelet Client Certificate

Do: 1. Check cert expiry: sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates 2. If expired, enable auto-renewal: sudo kubeadm certs renew all (kubeadm clusters) 3. Or approve pending CSR: kubectl get csr — find pending CSR for the node, then kubectl certificate approve <csr-name> 4. Restart kubelet: sudo systemctl restart kubelet Verify: kubectl get node <name> shows Ready. openssl x509 -noout -dates shows future NotAfter.

Escalation: Cloud Provider Network

When: Node is running but cannot reach the API server, ping fails to other nodes, CNI is healthy. Who: Cloud infrastructure team, cloud provider support Include in page: Node name, private IP, VPC/subnet, ip route show output from node, cloud provider network event log

Escalation: Drain Node + Engage Infrastructure Team

When: Kernel panic, BUG, or hardware MCE errors in dmesg. Who: Infrastructure / platform SRE team Include in page: dmesg | tail -200 output, node name, instance type, cloud provider AZ, time of first error


Edge Cases

  • Cloud spot/preemptible instance terminated: The node was terminated by the cloud provider. Check cloud console for "Spot interruption" events. This is expected behavior — verify cluster autoscaler replaces the node.
  • Node NotReady after etcd issues: If etcd is degraded, the API server may not respond, causing nodes to appear NotReady even though they're healthy. Check etcd health before blaming nodes.
  • NotReady after upgrade: A Kubernetes version upgrade may require a kubelet restart or config update. Check kubelet --version on the node matches the API server version (±1 minor version).
  • Network plugin (CNI) upgrade caused NotReady: During CNI upgrade, the DaemonSet rolls out across nodes. Each node is briefly NotReady. This is expected and should self-resolve — do not intervene unless a node is stuck >5 min.
  • All nodes NotReady simultaneously: This is a control plane issue, not a node issue. Check API server and etcd health first.

Cross-References