Remediation: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook¶
Immediate Fix (DevOps Tooling — Domain C)¶
The fix requires an Ansible playbook to update NIC firmware across all affected nodes.
Step 1: Recover the current node via IPMI¶
# From the iLO/IPMI console
worker-node-07:~# modprobe -r bnxt_en && modprobe bnxt_en
# If driver reload fails:
worker-node-07:~# reboot
After reboot, the node comes back online:
$ kubectl get node worker-node-07
NAME STATUS ROLES AGE VERSION
worker-node-07 Ready <none> 47d v1.28.4
Step 2: Create the firmware update Ansible playbook¶
# devops/ansible/playbooks/nic-firmware-update.yml
---
- name: Update Broadcom BCM57414 NIC firmware
hosts: k8s_workers
serial: 1 # Rolling update, one node at a time
become: true
vars:
target_firmware: "222.1.148.0"
firmware_package: "bnxt-firmware-222.1.148.0.pkg"
firmware_url: "https://artifacts.internal/firmware/broadcom/{{ firmware_package }}"
tasks:
- name: Check current firmware version
shell: ethtool -i eno1 | grep firmware-version | awk '{print $2}'
register: current_firmware
changed_when: false
- name: Skip if already updated
debug:
msg: "Node {{ inventory_hostname }} already on {{ current_firmware.stdout }}"
when: current_firmware.stdout == target_firmware
- name: Cordon node in Kubernetes
delegate_to: localhost
become: false
command: kubectl cordon {{ inventory_hostname }}
when: current_firmware.stdout != target_firmware
- name: Drain node
delegate_to: localhost
become: false
command: >
kubectl drain {{ inventory_hostname }}
--ignore-daemonsets
--delete-emptydir-data
--timeout=300s
when: current_firmware.stdout != target_firmware
- name: Download firmware package
get_url:
url: "{{ firmware_url }}"
dest: "/tmp/{{ firmware_package }}"
when: current_firmware.stdout != target_firmware
- name: Apply firmware update
command: bnxtnvm -dev=eno1 -force -y install /tmp/{{ firmware_package }}
when: current_firmware.stdout != target_firmware
register: fw_update
- name: Reboot node to activate firmware
reboot:
reboot_timeout: 300
when: fw_update is changed
- name: Verify firmware version
shell: ethtool -i eno1 | grep firmware-version | awk '{print $2}'
register: new_firmware
failed_when: new_firmware.stdout != target_firmware
when: fw_update is changed
- name: Uncordon node
delegate_to: localhost
become: false
command: kubectl uncordon {{ inventory_hostname }}
when: current_firmware.stdout != target_firmware
Step 3: Run the playbook against affected nodes¶
$ ansible-playbook devops/ansible/playbooks/nic-firmware-update.yml \
--limit "worker-node-03,worker-node-07,worker-node-09,worker-node-11" \
-v
PLAY [Update Broadcom BCM57414 NIC firmware] ***
...
PLAY RECAP *********************************************************************
worker-node-03 : ok=8 changed=5 unreachable=0 failed=0
worker-node-07 : ok=8 changed=5 unreachable=0 failed=0
worker-node-09 : ok=8 changed=5 unreachable=0 failed=0
worker-node-11 : ok=8 changed=5 unreachable=0 failed=0
Verification¶
Domain A (Kubernetes) — All nodes Ready¶
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker-node-01 Ready <none> 92d v1.28.4
worker-node-02 Ready <none> 92d v1.28.4
worker-node-03 Ready <none> 47d v1.28.4
...
worker-node-11 Ready <none> 47d v1.28.4
Domain B (Datacenter) — Firmware versions consistent¶
$ ansible k8s_workers -m shell -a "ethtool -i eno1 | grep firmware"
worker-node-01 | SUCCESS => firmware-version: 222.1.148.0
worker-node-02 | SUCCESS => firmware-version: 222.1.148.0
worker-node-03 | SUCCESS => firmware-version: 222.1.148.0
...
worker-node-11 | SUCCESS => firmware-version: 222.1.148.0
Domain C (DevOps Tooling) — Playbook in source control¶
$ ls devops/ansible/playbooks/nic-firmware-update.yml
devops/ansible/playbooks/nic-firmware-update.yml
$ ansible-playbook devops/ansible/playbooks/nic-firmware-update.yml --check
# Dry run shows no changes needed — all nodes updated
Prevention¶
-
Monitoring: Add a firmware version compliance check that runs daily via Ansible or a Prometheus exporter. Alert when any node's firmware does not match the approved version.
-
Runbook: Every new hardware batch must go through the firmware baseline playbook before being added to the Kubernetes cluster. Add a pre-join validation step to the node provisioning pipeline.
-
Architecture: Use an Ansible role that runs the firmware check as part of the node bootstrap process. Tag nodes with firmware version labels in Kubernetes so drift is visible: