Skip to content

Incident Replay: HBA Firmware Mismatch

Setup

  • System context: Storage cluster with 12 nodes, each equipped with dual-port Broadcom HBAs connecting to a SAN fabric. A firmware update was applied to half the fleet.
  • Time: Monday 11:30 UTC
  • Your role: Storage engineer / on-call SRE

Round 1: Alert Fires

[Pressure cue: "SAN monitoring fires — 6 nodes showing intermittent path failures to storage arrays. I/O latency spikes hitting production databases."]

What you see: Multipath shows flapping paths on 6 of 12 storage nodes. multipath -ll shows paths toggling between active and failed every few minutes. The other 6 nodes are stable.

Choose your action: - A) Restart multipathd on the affected hosts - B) Check which 6 nodes are affected and what they have in common - C) Fail over all I/O to a single HBA port to stabilize - D) Contact the SAN team to check switch port errors

If you chose A:

[Result: multipathd restart briefly stabilizes paths but they start flapping again within 5 minutes. Not a real fix.]

[Result: The 6 affected nodes are the ones that received the HBA firmware update last week. The other 6 are still on the old firmware. systool -c fc_host -v shows firmware version mismatch between the two groups. Proceed to Round 2.]

If you chose C:

[Result: Single-path I/O works but you have lost redundancy. If that one port fails, the node loses all storage access.]

If you chose D:

[Result: SAN switches show CRC errors on ports connected to updated nodes. Helpful clue but does not identify root cause directly. Adds 10 minutes.]

Round 2: First Triage Data

[Pressure cue: "Database team reports intermittent query timeouts. They are considering failing over to the DR site."]

What you see: Updated nodes are on HBA firmware 12.8.350. Non-updated nodes are on 12.6.240. The SAN switch firmware was updated 2 months ago and has a known incompatibility with HBA firmware 12.8.x when NPIV is enabled.

Choose your action: - A) Roll back HBA firmware on the 6 updated nodes to 12.6.240 - B) Disable NPIV on the affected nodes to work around the incompatibility - C) Update the remaining 6 nodes to 12.8.350 so everything matches - D) Update the SAN switch firmware to a version compatible with 12.8.x

[Result: Firmware rollback via hbacmd restores stable paths. All 12 nodes now on consistent firmware. Path flapping stops. Proceed to Round 3.]

If you chose B:

[Result: Disabling NPIV breaks virtual HBA assignments for VMs. Not acceptable in this environment.]

If you chose C:

[Result: Updating the remaining nodes would make all 12 nodes affected. You would double the blast radius.]

If you chose D:

[Result: Correct long-term fix but SAN switch firmware updates require a maintenance window with full SAN redundancy verification. Cannot do this during an incident.]

Round 3: Root Cause Identification

[Pressure cue: "Paths are stable. Why did this happen?"]

What you see: Root cause: HBA firmware update was applied without checking the SAN switch firmware compatibility matrix. The Broadcom release notes for 12.8.350 explicitly list a known issue with the current SAN switch firmware version when NPIV is enabled.

Choose your action: - A) Create a firmware compatibility matrix and enforce pre-update checks - B) Schedule coordinated HBA + SAN switch firmware updates in next maintenance window - C) Add automated compatibility validation to the firmware update pipeline - D) All of the above

[Result: Comprehensive prevention: compatibility matrix documents known issues, coordinated updates ensure matching versions, automated checks prevent rollout of incompatible combinations. Proceed to Round 4.]

If you chose A:

[Result: Manual matrix helps but relies on humans checking it. Partial fix.]

If you chose B:

[Result: Fixes this specific case but does not prevent future mismatches.]

If you chose C:

[Result: Automation is good but needs the matrix to validate against.]

Round 4: Remediation

[Pressure cue: "All nodes stable. Close the incident."]

Actions: 1. Verify all nodes show healthy multipath: multipath -ll on each node 2. Confirm consistent HBA firmware: systool -c fc_host -v | grep firmware 3. Check I/O latency has returned to baseline 4. Document the firmware compatibility requirements 5. Schedule the coordinated SAN + HBA update for next maintenance window

Damage Report

  • Total downtime: 0 (degraded I/O but no full outage)
  • Blast radius: 6 nodes with intermittent storage path failures; database query latency increased 5x
  • Optimal resolution time: 20 minutes (correlate nodes -> identify firmware delta -> rollback)
  • If every wrong choice was made: 90+ minutes plus risk of total storage connectivity loss

Cross-References