Skip to content

Remediation: Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook

Immediate Fix (DevOps Tooling — Domain C)

The fix requires an Ansible playbook to update NIC firmware across all affected nodes.

Step 1: Recover the current node via IPMI

# From the iLO/IPMI console
worker-node-07:~# modprobe -r bnxt_en && modprobe bnxt_en
# If driver reload fails:
worker-node-07:~# reboot

After reboot, the node comes back online:

$ kubectl get node worker-node-07
NAME             STATUS   ROLES    AGE   VERSION
worker-node-07   Ready    <none>   47d   v1.28.4

Step 2: Create the firmware update Ansible playbook

# devops/ansible/playbooks/nic-firmware-update.yml
---
- name: Update Broadcom BCM57414 NIC firmware
  hosts: k8s_workers
  serial: 1  # Rolling update, one node at a time
  become: true
  vars:
    target_firmware: "222.1.148.0"
    firmware_package: "bnxt-firmware-222.1.148.0.pkg"
    firmware_url: "https://artifacts.internal/firmware/broadcom/{{ firmware_package }}"

  tasks:
    - name: Check current firmware version
      shell: ethtool -i eno1 | grep firmware-version | awk '{print $2}'
      register: current_firmware
      changed_when: false

    - name: Skip if already updated
      debug:
        msg: "Node {{ inventory_hostname }} already on {{ current_firmware.stdout }}"
      when: current_firmware.stdout == target_firmware

    - name: Cordon node in Kubernetes
      delegate_to: localhost
      become: false
      command: kubectl cordon {{ inventory_hostname }}
      when: current_firmware.stdout != target_firmware

    - name: Drain node
      delegate_to: localhost
      become: false
      command: >
        kubectl drain {{ inventory_hostname }}
        --ignore-daemonsets
        --delete-emptydir-data
        --timeout=300s
      when: current_firmware.stdout != target_firmware

    - name: Download firmware package
      get_url:
        url: "{{ firmware_url }}"
        dest: "/tmp/{{ firmware_package }}"
      when: current_firmware.stdout != target_firmware

    - name: Apply firmware update
      command: bnxtnvm -dev=eno1 -force -y install /tmp/{{ firmware_package }}
      when: current_firmware.stdout != target_firmware
      register: fw_update

    - name: Reboot node to activate firmware
      reboot:
        reboot_timeout: 300
      when: fw_update is changed

    - name: Verify firmware version
      shell: ethtool -i eno1 | grep firmware-version | awk '{print $2}'
      register: new_firmware
      failed_when: new_firmware.stdout != target_firmware
      when: fw_update is changed

    - name: Uncordon node
      delegate_to: localhost
      become: false
      command: kubectl uncordon {{ inventory_hostname }}
      when: current_firmware.stdout != target_firmware

Step 3: Run the playbook against affected nodes

$ ansible-playbook devops/ansible/playbooks/nic-firmware-update.yml \
    --limit "worker-node-03,worker-node-07,worker-node-09,worker-node-11" \
    -v

PLAY [Update Broadcom BCM57414 NIC firmware] ***
...
PLAY RECAP *********************************************************************
worker-node-03    : ok=8    changed=5    unreachable=0    failed=0
worker-node-07    : ok=8    changed=5    unreachable=0    failed=0
worker-node-09    : ok=8    changed=5    unreachable=0    failed=0
worker-node-11    : ok=8    changed=5    unreachable=0    failed=0

Verification

Domain A (Kubernetes) — All nodes Ready

$ kubectl get nodes
NAME             STATUS   ROLES    AGE   VERSION
worker-node-01   Ready    <none>   92d   v1.28.4
worker-node-02   Ready    <none>   92d   v1.28.4
worker-node-03   Ready    <none>   47d   v1.28.4
...
worker-node-11   Ready    <none>   47d   v1.28.4

Domain B (Datacenter) — Firmware versions consistent

$ ansible k8s_workers -m shell -a "ethtool -i eno1 | grep firmware"
worker-node-01 | SUCCESS => firmware-version: 222.1.148.0
worker-node-02 | SUCCESS => firmware-version: 222.1.148.0
worker-node-03 | SUCCESS => firmware-version: 222.1.148.0
...
worker-node-11 | SUCCESS => firmware-version: 222.1.148.0

Domain C (DevOps Tooling) — Playbook in source control

$ ls devops/ansible/playbooks/nic-firmware-update.yml
devops/ansible/playbooks/nic-firmware-update.yml

$ ansible-playbook devops/ansible/playbooks/nic-firmware-update.yml --check
# Dry run shows no changes needed — all nodes updated

Prevention

  • Monitoring: Add a firmware version compliance check that runs daily via Ansible or a Prometheus exporter. Alert when any node's firmware does not match the approved version.

  • Runbook: Every new hardware batch must go through the firmware baseline playbook before being added to the Kubernetes cluster. Add a pre-join validation step to the node provisioning pipeline.

  • Architecture: Use an Ansible role that runs the firmware check as part of the node bootstrap process. Tag nodes with firmware version labels in Kubernetes so drift is visible:

kubectl label node worker-node-07 firmware.nic/version=222.1.148.0