Incident Replay: HBA Firmware Mismatch¶
Setup¶
- System context: Storage cluster with 12 nodes, each equipped with dual-port Broadcom HBAs connecting to a SAN fabric. A firmware update was applied to half the fleet.
- Time: Monday 11:30 UTC
- Your role: Storage engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "SAN monitoring fires — 6 nodes showing intermittent path failures to storage arrays. I/O latency spikes hitting production databases."]
What you see:
Multipath shows flapping paths on 6 of 12 storage nodes. multipath -ll shows paths toggling between active and failed every few minutes. The other 6 nodes are stable.
Choose your action: - A) Restart multipathd on the affected hosts - B) Check which 6 nodes are affected and what they have in common - C) Fail over all I/O to a single HBA port to stabilize - D) Contact the SAN team to check switch port errors
If you chose A:¶
[Result: multipathd restart briefly stabilizes paths but they start flapping again within 5 minutes. Not a real fix.]
If you chose B (recommended):¶
[Result: The 6 affected nodes are the ones that received the HBA firmware update last week. The other 6 are still on the old firmware.
systool -c fc_host -vshows firmware version mismatch between the two groups. Proceed to Round 2.]
If you chose C:¶
[Result: Single-path I/O works but you have lost redundancy. If that one port fails, the node loses all storage access.]
If you chose D:¶
[Result: SAN switches show CRC errors on ports connected to updated nodes. Helpful clue but does not identify root cause directly. Adds 10 minutes.]
Round 2: First Triage Data¶
[Pressure cue: "Database team reports intermittent query timeouts. They are considering failing over to the DR site."]
What you see: Updated nodes are on HBA firmware 12.8.350. Non-updated nodes are on 12.6.240. The SAN switch firmware was updated 2 months ago and has a known incompatibility with HBA firmware 12.8.x when NPIV is enabled.
Choose your action: - A) Roll back HBA firmware on the 6 updated nodes to 12.6.240 - B) Disable NPIV on the affected nodes to work around the incompatibility - C) Update the remaining 6 nodes to 12.8.350 so everything matches - D) Update the SAN switch firmware to a version compatible with 12.8.x
If you chose A (recommended):¶
[Result: Firmware rollback via
hbacmdrestores stable paths. All 12 nodes now on consistent firmware. Path flapping stops. Proceed to Round 3.]
If you chose B:¶
[Result: Disabling NPIV breaks virtual HBA assignments for VMs. Not acceptable in this environment.]
If you chose C:¶
[Result: Updating the remaining nodes would make all 12 nodes affected. You would double the blast radius.]
If you chose D:¶
[Result: Correct long-term fix but SAN switch firmware updates require a maintenance window with full SAN redundancy verification. Cannot do this during an incident.]
Round 3: Root Cause Identification¶
[Pressure cue: "Paths are stable. Why did this happen?"]
What you see: Root cause: HBA firmware update was applied without checking the SAN switch firmware compatibility matrix. The Broadcom release notes for 12.8.350 explicitly list a known issue with the current SAN switch firmware version when NPIV is enabled.
Choose your action: - A) Create a firmware compatibility matrix and enforce pre-update checks - B) Schedule coordinated HBA + SAN switch firmware updates in next maintenance window - C) Add automated compatibility validation to the firmware update pipeline - D) All of the above
If you chose D (recommended):¶
[Result: Comprehensive prevention: compatibility matrix documents known issues, coordinated updates ensure matching versions, automated checks prevent rollout of incompatible combinations. Proceed to Round 4.]
If you chose A:¶
[Result: Manual matrix helps but relies on humans checking it. Partial fix.]
If you chose B:¶
[Result: Fixes this specific case but does not prevent future mismatches.]
If you chose C:¶
[Result: Automation is good but needs the matrix to validate against.]
Round 4: Remediation¶
[Pressure cue: "All nodes stable. Close the incident."]
Actions:
1. Verify all nodes show healthy multipath: multipath -ll on each node
2. Confirm consistent HBA firmware: systool -c fc_host -v | grep firmware
3. Check I/O latency has returned to baseline
4. Document the firmware compatibility requirements
5. Schedule the coordinated SAN + HBA update for next maintenance window
Damage Report¶
- Total downtime: 0 (degraded I/O but no full outage)
- Blast radius: 6 nodes with intermittent storage path failures; database query latency increased 5x
- Optimal resolution time: 20 minutes (correlate nodes -> identify firmware delta -> rollback)
- If every wrong choice was made: 90+ minutes plus risk of total storage connectivity loss
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Storage Ops
- Primer: Firmware
- Footguns: Firmware