Symptoms: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook¶

Domains: kubernetes_ops | datacenter_ops | devops_tooling Level: L3 Estimated time: 45 min

Initial Alert¶

Kubernetes node alert fires at 22:14 UTC:

CRITICAL: KubeNodeNotReady
  node: worker-node-07.prod.internal
  condition: Ready=False
  duration: 5m
  message: "Kubelet stopped posting node status"

Followed by:

WARNING: KubeNodeUnreachable — worker-node-07 unreachable for 3m
WARNING: 12 pods on worker-node-07 in Unknown state
CRITICAL: worker-node-07 — ICMP ping loss 100% from monitoring host

Observable Symptoms¶

kubectl get nodes shows worker-node-07 as NotReady for 5+ minutes.
All pods on the node are in Unknown state; Kubernetes begins pod eviction after the 5-minute pod-eviction-timeout.
ICMP pings from the monitoring server to worker-node-07 (10.0.4.27) fail 100%.
IPMI/iLO console shows the node is powered on and the OS is running. dmesg on the iLO console shows no kernel panics.
The node was fine for 3 weeks since the last reboot. No recent deployments to this node.
Two other nodes (worker-node-03 and worker-node-11) experienced the same issue in the past 7 days but recovered after a reboot.

The Misleading Signal¶

A node going NotReady with 100% ping loss looks like a straightforward network outage or node crash. The engineer's instinct is to check Kubernetes node conditions, network cables, switch ports, and potentially reboot the node. The fact that the node is powered on and the OS is running (per IPMI) makes it look like a network cable or switch issue. The pattern of "works for weeks, then drops" suggests an intermittent hardware failure.