Solution: NVMe Drive Disappeared After Reboot¶
Triage¶
- Confirm the drive is not visible at any layer:
lspci | grep -i nvme-- check PCIe enumerationlsblk-- check block device presencelsmod | grep nvme-- confirm driver modules loadeddmesg | grep -iE 'nvme|pci.*slot'-- look for enumeration or driver errors- Check the BMC/iDRAC system event log for hardware alerts:
ipmitool sel listor iDRAC web UI under Hardware > Storage- Verify no BIOS settings changed during the patching window:
- Check if BIOS was updated; review BIOS changelog for PCIe bifurcation changes
- Enter BIOS setup and confirm PCIe slot 3 is enabled and bifurcation is set correctly (x4 for NVMe)
Root Cause¶
The most common causes for an NVMe drive disappearing after reboot:
- PCIe slot/riser issue: The drive or riser card became partially unseated. Thermal cycling during reboot can cause subtle connector shifts.
- BIOS/firmware change: A BIOS update reset PCIe bifurcation settings, disabling the NVMe slot or changing it from x4 to x16 mode when the backplane expects bifurcated mode.
- Drive failure: The NVMe controller on the drive has failed. Check if the drive is warm to the touch (power present but controller hung) or cold (no power at all).
- Kernel/driver regression: A kernel update removed or broke NVMe support for this specific controller. Less likely but possible.
In this scenario, the root cause is a partially unseated drive after thermal cycling combined with a marginal connector that was previously making just enough contact.
Fix¶
- Immediate: Schedule a brief maintenance window (5 min) to physically reseat the NVMe drive:
- Power down the server gracefully
- Remove and firmly re-insert the NVMe drive in PCIe slot 3
- Inspect the PCIe connector and backplane for damage or debris
- Power on and verify the drive appears in BIOS POST,
lspci, andlsblk - Validate data integrity: Once the drive is back, check the filesystem:
nvme smart-log /dev/nvme0n1-- check for media errors, unsafe shutdownsxfs_repair -n /dev/nvme0n1p1(or appropriate fsck) -- dry-run check- Rejoin the cluster: Follow the database procedure to rebuild/resync the local cache tier.
- If reseat does not work:
- Test the drive in another slot or another server
- Test the slot with a known-good NVMe drive
- If the drive is dead, initiate RMA with Samsung; if the slot is dead, RMA the riser/server
Rollback / Safety¶
- The database cluster is designed to tolerate one node's cache being offline; confirm cluster health before and after.
- Do not attempt hot-plug of NVMe unless the server and backplane explicitly support it.
- If the drive had unsaved data, do NOT run fsck with repair flags until a backup assessment is done.
- Keep the maintenance window short; the cluster is already degraded.
Common Traps¶
- Assuming drive failure immediately: Many NVMe "disappearances" are connector/seating issues, not dead drives. Always reseat first.
- Ignoring BIOS bifurcation: After BIOS updates, PCIe bifurcation settings can silently reset to defaults, making NVMe slots invisible.
- Forgetting to check the riser card: On rack servers, NVMe drives often connect through a riser; a loose riser affects all devices on it.
- Not checking
dmesgtimestamps: Errors from the current boot vs. previous boot can be confused. Usejournalctl -b 0for the current boot only. - Hot-plugging NVMe without support: Can cause data corruption or electrical damage if the backplane doesn't support it.