Symptoms: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook¶
Domains: kubernetes_ops | datacenter_ops | devops_tooling Level: L3 Estimated time: 45 min
Initial Alert¶
Kubernetes node alert fires at 22:14 UTC:
CRITICAL: KubeNodeNotReady
node: worker-node-07.prod.internal
condition: Ready=False
duration: 5m
message: "Kubelet stopped posting node status"
Followed by:
WARNING: KubeNodeUnreachable — worker-node-07 unreachable for 3m
WARNING: 12 pods on worker-node-07 in Unknown state
CRITICAL: worker-node-07 — ICMP ping loss 100% from monitoring host
Observable Symptoms¶
kubectl get nodesshows worker-node-07 asNotReadyfor 5+ minutes.- All pods on the node are in
Unknownstate; Kubernetes begins pod eviction after the 5-minutepod-eviction-timeout. - ICMP pings from the monitoring server to worker-node-07 (10.0.4.27) fail 100%.
- IPMI/iLO console shows the node is powered on and the OS is running.
dmesgon the iLO console shows no kernel panics. - The node was fine for 3 weeks since the last reboot. No recent deployments to this node.
- Two other nodes (worker-node-03 and worker-node-11) experienced the same issue in the past 7 days but recovered after a reboot.
The Misleading Signal¶
A node going NotReady with 100% ping loss looks like a straightforward network outage or node crash. The engineer's instinct is to check Kubernetes node conditions, network cables, switch ports, and potentially reboot the node. The fact that the node is powered on and the OS is running (per IPMI) makes it look like a network cable or switch issue. The pattern of "works for weeks, then drops" suggests an intermittent hardware failure.