Investigation: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook¶

Phase 1: Kubernetes Investigation (Dead End)¶

Check node conditions:

$ kubectl describe node worker-node-07 | grep -A10 "Conditions"
Conditions:
  Type             Status    LastHeartbeatTime                 Reason
  ----             ------    -----------------                 ------
  Ready            Unknown   2026-03-19T22:09:14Z              NodeStatusUnknown
  MemoryPressure   Unknown   2026-03-19T22:09:14Z              NodeStatusUnknown
  DiskPressure     Unknown   2026-03-19T22:09:14Z              NodeStatusUnknown
  PIDPressure      Unknown   2026-03-19T22:09:14Z              NodeStatusUnknown

Last heartbeat was 5 minutes ago. The kubelet stopped reporting. This is consistent with a network partition or a kubelet crash. Cannot SSH to the node:

$ ssh worker-node-07.prod.internal
ssh: connect to host worker-node-07.prod.internal port 22: No route to host

Check from the monitoring network:

$ ping -c 3 10.0.4.27
PING 10.0.4.27 (10.0.4.27) 56(84) bytes of data.
--- 10.0.4.27 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2003ms

Node is completely unreachable on the network. Check the switch port:

# From the network switch
switch# show interface ethernet 1/27
Ethernet1/27 is up
  Hardware: 25GbE, address: aa:bb:cc:dd:ee:27
  Internet address is not set
  MTU 9216 bytes
  Speed 25 Gbps
  Last link flap: 0d 00:05:12 ago

The switch port is up, and the link just flapped 5 minutes ago — exactly when the node went NotReady. This looks like a cable or NIC issue.

The Pivot¶

Access the node via IPMI/iLO out-of-band console:

# Via ipmitool
$ ipmitool -I lanplus -H 10.0.5.27 -U admin -P *** sol activate

# On the node's console
worker-node-07:~# ip link show eno1
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9000 qdisc mq state DOWN mode DEFAULT
    link/ether aa:bb:cc:dd:ee:27 brd ff:ff:ff:ff:ff:ff

worker-node-07:~# dmesg | tail -20
[1843201.442] bnxt_en 0000:3b:00.0 eno1: NIC Link is Down
[1843201.443] bnxt_en 0000:3b:00.0 eno1: firmware heartbeat stalled, resetting
[1843201.890] bnxt_en 0000:3b:00.0 eno1: firmware reset failed, rc=-110
[1843202.001] bnxt_en 0000:3b:00.0 eno1: FW reset failed with status 0xffff

The Broadcom NIC firmware crashed. The driver tried to reset it and failed. This is a known firmware bug in bnxt_en driver with firmware version 222.1.120.0 on Broadcom BCM57414 NICs.

Phase 2: Datacenter Ops Investigation (Root Cause)¶

Check the firmware version:

worker-node-07:~# ethtool -i eno1
driver: bnxt_en
version: 1.10.2-226.0.130.0
firmware-version: 222.1.120.0
bus-info: 0000:3b:00.0

Check the known issues database:

Broadcom BCM57414 firmware 222.1.120.0
Known issue: Under sustained high-throughput traffic (>18Gbps) for extended periods,
the firmware watchdog timer can stall, causing the NIC to become unresponsive.
Fix: Update to firmware version 222.1.148.0 or later.
Affected: All BCM57414 NICs with firmware < 222.1.148.0

Check other nodes for the same firmware:

# From the Ansible inventory
$ ansible k8s_workers -m shell -a "ethtool -i eno1 | grep firmware" --limit '!worker-node-07'
worker-node-01 | SUCCESS => firmware-version: 222.1.148.0
worker-node-02 | SUCCESS => firmware-version: 222.1.148.0
worker-node-03 | SUCCESS => firmware-version: 222.1.120.0
worker-node-04 | SUCCESS => firmware-version: 222.1.148.0
worker-node-05 | SUCCESS => firmware-version: 222.1.148.0
worker-node-06 | SUCCESS => firmware-version: 222.1.148.0
worker-node-08 | SUCCESS => firmware-version: 222.1.148.0
worker-node-09 | SUCCESS => firmware-version: 222.1.120.0
worker-node-10 | SUCCESS => firmware-version: 222.1.148.0
worker-node-11 | SUCCESS => firmware-version: 222.1.120.0

Nodes 03, 07, 09, and 11 are on the old firmware. Nodes 03 and 11 are the ones that had the same issue in the past 7 days. These were the nodes that were added in the last hardware batch and missed the firmware update campaign.

Domain Bridge: Why This Crossed Domains¶

Key insight: The symptom was a Kubernetes node going NotReady (kubernetes_ops), the root cause was a NIC firmware bug (datacenter_ops), and the fix requires an Ansible playbook to update firmware fleet-wide (devops_tooling). This is common because: Kubernetes node health depends on network connectivity, which depends on hardware firmware. Firmware updates are a datacenter operations concern that must be managed with automation tooling to ensure fleet consistency.

Root Cause¶

Broadcom BCM57414 NIC firmware version 222.1.120.0 has a known bug where the firmware watchdog stalls under sustained high throughput, causing the NIC to become unresponsive. Four nodes in the cluster missed the firmware update campaign when they were added as a late hardware batch. The bug manifests after days to weeks of operation, depending on traffic patterns.