Investigation: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook¶
Phase 1: Kubernetes Investigation (Dead End)¶
Check node conditions:
$ kubectl describe node worker-node-07 | grep -A10 "Conditions"
Conditions:
Type Status LastHeartbeatTime Reason
---- ------ ----------------- ------
Ready Unknown 2026-03-19T22:09:14Z NodeStatusUnknown
MemoryPressure Unknown 2026-03-19T22:09:14Z NodeStatusUnknown
DiskPressure Unknown 2026-03-19T22:09:14Z NodeStatusUnknown
PIDPressure Unknown 2026-03-19T22:09:14Z NodeStatusUnknown
Last heartbeat was 5 minutes ago. The kubelet stopped reporting. This is consistent with a network partition or a kubelet crash. Cannot SSH to the node:
$ ssh worker-node-07.prod.internal
ssh: connect to host worker-node-07.prod.internal port 22: No route to host
Check from the monitoring network:
$ ping -c 3 10.0.4.27
PING 10.0.4.27 (10.0.4.27) 56(84) bytes of data.
--- 10.0.4.27 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2003ms
Node is completely unreachable on the network. Check the switch port:
# From the network switch
switch# show interface ethernet 1/27
Ethernet1/27 is up
Hardware: 25GbE, address: aa:bb:cc:dd:ee:27
Internet address is not set
MTU 9216 bytes
Speed 25 Gbps
Last link flap: 0d 00:05:12 ago
The switch port is up, and the link just flapped 5 minutes ago — exactly when the node went NotReady. This looks like a cable or NIC issue.
The Pivot¶
Access the node via IPMI/iLO out-of-band console:
# Via ipmitool
$ ipmitool -I lanplus -H 10.0.5.27 -U admin -P *** sol activate
# On the node's console
worker-node-07:~# ip link show eno1
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq state DOWN mode DEFAULT
link/ether aa:bb:cc:dd:ee:27 brd ff:ff:ff:ff:ff:ff
worker-node-07:~# dmesg | tail -20
[1843201.442] bnxt_en 0000:3b:00.0 eno1: NIC Link is Down
[1843201.443] bnxt_en 0000:3b:00.0 eno1: firmware heartbeat stalled, resetting
[1843201.890] bnxt_en 0000:3b:00.0 eno1: firmware reset failed, rc=-110
[1843202.001] bnxt_en 0000:3b:00.0 eno1: FW reset failed with status 0xffff
The Broadcom NIC firmware crashed. The driver tried to reset it and failed. This is a known firmware bug in bnxt_en driver with firmware version 222.1.120.0 on Broadcom BCM57414 NICs.
Phase 2: Datacenter Ops Investigation (Root Cause)¶
Check the firmware version:
worker-node-07:~# ethtool -i eno1
driver: bnxt_en
version: 1.10.2-226.0.130.0
firmware-version: 222.1.120.0
bus-info: 0000:3b:00.0
Check the known issues database:
Broadcom BCM57414 firmware 222.1.120.0
Known issue: Under sustained high-throughput traffic (>18Gbps) for extended periods,
the firmware watchdog timer can stall, causing the NIC to become unresponsive.
Fix: Update to firmware version 222.1.148.0 or later.
Affected: All BCM57414 NICs with firmware < 222.1.148.0
Check other nodes for the same firmware:
# From the Ansible inventory
$ ansible k8s_workers -m shell -a "ethtool -i eno1 | grep firmware" --limit '!worker-node-07'
worker-node-01 | SUCCESS => firmware-version: 222.1.148.0
worker-node-02 | SUCCESS => firmware-version: 222.1.148.0
worker-node-03 | SUCCESS => firmware-version: 222.1.120.0
worker-node-04 | SUCCESS => firmware-version: 222.1.148.0
worker-node-05 | SUCCESS => firmware-version: 222.1.148.0
worker-node-06 | SUCCESS => firmware-version: 222.1.148.0
worker-node-08 | SUCCESS => firmware-version: 222.1.148.0
worker-node-09 | SUCCESS => firmware-version: 222.1.120.0
worker-node-10 | SUCCESS => firmware-version: 222.1.148.0
worker-node-11 | SUCCESS => firmware-version: 222.1.120.0
Nodes 03, 07, 09, and 11 are on the old firmware. Nodes 03 and 11 are the ones that had the same issue in the past 7 days. These were the nodes that were added in the last hardware batch and missed the firmware update campaign.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was a Kubernetes node going NotReady (kubernetes_ops), the root cause was a NIC firmware bug (datacenter_ops), and the fix requires an Ansible playbook to update firmware fleet-wide (devops_tooling). This is common because: Kubernetes node health depends on network connectivity, which depends on hardware firmware. Firmware updates are a datacenter operations concern that must be managed with automation tooling to ensure fleet consistency.
Root Cause¶
Broadcom BCM57414 NIC firmware version 222.1.120.0 has a known bug where the firmware watchdog stalls under sustained high throughput, causing the NIC to become unresponsive. Four nodes in the cluster missed the firmware update campaign when they were added as a late hardware batch. The bug manifests after days to weeks of operation, depending on traffic patterns.