Solution: HBA Firmware Mismatch Causing I/O Errors¶

Triage¶

Confirm the firmware mismatch:
On each host: cat /sys/class/scsi_host/host*/fwrev or systool -c fc_host -v | grep firmware
Build a table of hostname -> firmware version
Correlate errors to firmware version:
Check /var/log/messages on affected hosts for lpfc driver errors
multipathd show paths -- look for paths in "faulty" or "shaky" state
multipath -ll -- check path group states
Check the FC fabric for RSCN activity:
On Brocade switches: fabricshow, nsshow, errdump
Look for RSCN storms or frequent zone reconfigurations
Review the firmware release notes:
Download from Broadcom/Emulex support site
Search for: FPIN, ALUA, RSCN, "minimum firmware", "mixed environment"

Root Cause¶

The new firmware (14.2.579.6) enables Fabric Performance Impact Notifications (FPIN) by default. FPIN sends congestion and link integrity notifications via the FC fabric. These notifications generate RSCNs (Registered State Change Notifications) that are broadcast to all hosts in the same zone.

The older firmware (12.8.351.49) does not understand FPIN-related RSCNs and misinterprets them as target port state changes. This causes the lpfc driver on old-firmware hosts to temporarily mark paths as failed and attempt SCSI device recovery. The paths recover after a few seconds when the SCSI error handler resets the LUN, but the burst of I/O errors causes application-visible failures.

The updated hosts handle FPINs correctly and are unaffected.

Fix¶

Immediate mitigation (if updates cannot be done right now):
Disable FPIN on the updated hosts temporarily:
- echo 0 > /sys/class/scsi_host/host0/lpfc_enable_fpin
- Repeat for all FC host adapters
This stops the FPIN-triggered RSCNs and stabilizes the old-firmware hosts
Note: this is a runtime change; it reverts on reboot
Definitive fix: Complete the firmware update on the remaining 8 hosts:
Schedule a rolling maintenance window (one host at a time)
Update procedure per host:
- Drain workload (evacuate VMs or stop applications)
- hbacmd download <wwn> <firmware.grp> or use Emulex OneCommand Manager
- Reboot the server (firmware activates on reboot)
- Verify new firmware: systool -c fc_host -v | grep firmware
- Verify multipath health: multipath -ll -- all paths active
- Return workload
Post-update validation:
Re-enable FPIN on all hosts (should be default with new firmware)
Monitor for 24 hours: no I/O errors, no path flapping
Verify fabric switch logs are clean (no RSCN storms)

Rollback / Safety¶

If the firmware update itself causes issues, the old firmware can be re-flashed.
Disabling FPIN is a safe interim measure with no performance impact.
Always update one host at a time in a rolling fashion to avoid fleet-wide impact.
Ensure multipath has at least 2 active paths before starting any host update.

Common Traps¶

Blaming the SAN or fabric: The storage and switches are fine; the issue is host-side firmware incompatibility.
Partial fleet updates: Never leave a mixed firmware state in production. If a maintenance window closes early, disable new features on updated hosts until the rest can be done.
Ignoring RSCN storms: Each FPIN generates RSCNs that fan out to all hosts in the zone. In a large fabric, this creates a cascade effect.
Not reading release notes: The FPIN behavioral change was documented in the firmware release notes as "enabled by default." Always review release notes before fleet-wide firmware updates.
Updating firmware without draining workload: HBA firmware updates require a reboot. Rebooting a server with active I/O to the SAN can cause data corruption if writes are in flight.