Skip to content

Solution: HBA Firmware Mismatch Causing I/O Errors

Triage

  1. Confirm the firmware mismatch:
  2. On each host: cat /sys/class/scsi_host/host*/fwrev or systool -c fc_host -v | grep firmware
  3. Build a table of hostname -> firmware version
  4. Correlate errors to firmware version:
  5. Check /var/log/messages on affected hosts for lpfc driver errors
  6. multipathd show paths -- look for paths in "faulty" or "shaky" state
  7. multipath -ll -- check path group states
  8. Check the FC fabric for RSCN activity:
  9. On Brocade switches: fabricshow, nsshow, errdump
  10. Look for RSCN storms or frequent zone reconfigurations
  11. Review the firmware release notes:
  12. Download from Broadcom/Emulex support site
  13. Search for: FPIN, ALUA, RSCN, "minimum firmware", "mixed environment"

Root Cause

The new firmware (14.2.579.6) enables Fabric Performance Impact Notifications (FPIN) by default. FPIN sends congestion and link integrity notifications via the FC fabric. These notifications generate RSCNs (Registered State Change Notifications) that are broadcast to all hosts in the same zone.

The older firmware (12.8.351.49) does not understand FPIN-related RSCNs and misinterprets them as target port state changes. This causes the lpfc driver on old-firmware hosts to temporarily mark paths as failed and attempt SCSI device recovery. The paths recover after a few seconds when the SCSI error handler resets the LUN, but the burst of I/O errors causes application-visible failures.

The updated hosts handle FPINs correctly and are unaffected.

Fix

  1. Immediate mitigation (if updates cannot be done right now):
  2. Disable FPIN on the updated hosts temporarily:
    • echo 0 > /sys/class/scsi_host/host0/lpfc_enable_fpin
    • Repeat for all FC host adapters
  3. This stops the FPIN-triggered RSCNs and stabilizes the old-firmware hosts
  4. Note: this is a runtime change; it reverts on reboot
  5. Definitive fix: Complete the firmware update on the remaining 8 hosts:
  6. Schedule a rolling maintenance window (one host at a time)
  7. Update procedure per host:
    • Drain workload (evacuate VMs or stop applications)
    • hbacmd download <wwn> <firmware.grp> or use Emulex OneCommand Manager
    • Reboot the server (firmware activates on reboot)
    • Verify new firmware: systool -c fc_host -v | grep firmware
    • Verify multipath health: multipath -ll -- all paths active
    • Return workload
  8. Post-update validation:
  9. Re-enable FPIN on all hosts (should be default with new firmware)
  10. Monitor for 24 hours: no I/O errors, no path flapping
  11. Verify fabric switch logs are clean (no RSCN storms)

Rollback / Safety

  • If the firmware update itself causes issues, the old firmware can be re-flashed.
  • Disabling FPIN is a safe interim measure with no performance impact.
  • Always update one host at a time in a rolling fashion to avoid fleet-wide impact.
  • Ensure multipath has at least 2 active paths before starting any host update.

Common Traps

  • Blaming the SAN or fabric: The storage and switches are fine; the issue is host-side firmware incompatibility.
  • Partial fleet updates: Never leave a mixed firmware state in production. If a maintenance window closes early, disable new features on updated hosts until the rest can be done.
  • Ignoring RSCN storms: Each FPIN generates RSCNs that fan out to all hosts in the zone. In a large fabric, this creates a cascade effect.
  • Not reading release notes: The FPIN behavioral change was documented in the firmware release notes as "enabled by default." Always review release notes before fleet-wide firmware updates.
  • Updating firmware without draining workload: HBA firmware updates require a reboot. Rebooting a server with active I/O to the SAN can cause data corruption if writes are in flight.