Solution: NVMe Drive Disappeared After Reboot¶

Triage¶

Confirm the drive is not visible at any layer:
lspci | grep -i nvme -- check PCIe enumeration
lsblk -- check block device presence
lsmod | grep nvme -- confirm driver modules loaded
dmesg | grep -iE 'nvme|pci.*slot' -- look for enumeration or driver errors
Check the BMC/iDRAC system event log for hardware alerts:
ipmitool sel list or iDRAC web UI under Hardware > Storage
Verify no BIOS settings changed during the patching window:
Check if BIOS was updated; review BIOS changelog for PCIe bifurcation changes
Enter BIOS setup and confirm PCIe slot 3 is enabled and bifurcation is set correctly (x4 for NVMe)

Root Cause¶

The most common causes for an NVMe drive disappearing after reboot:

PCIe slot/riser issue: The drive or riser card became partially unseated. Thermal cycling during reboot can cause subtle connector shifts.
BIOS/firmware change: A BIOS update reset PCIe bifurcation settings, disabling the NVMe slot or changing it from x4 to x16 mode when the backplane expects bifurcated mode.
Drive failure: The NVMe controller on the drive has failed. Check if the drive is warm to the touch (power present but controller hung) or cold (no power at all).
Kernel/driver regression: A kernel update removed or broke NVMe support for this specific controller. Less likely but possible.

In this scenario, the root cause is a partially unseated drive after thermal cycling combined with a marginal connector that was previously making just enough contact.

Fix¶

Immediate: Schedule a brief maintenance window (5 min) to physically reseat the NVMe drive:
Power down the server gracefully
Remove and firmly re-insert the NVMe drive in PCIe slot 3
Inspect the PCIe connector and backplane for damage or debris
Power on and verify the drive appears in BIOS POST, lspci, and lsblk
Validate data integrity: Once the drive is back, check the filesystem:
nvme smart-log /dev/nvme0n1 -- check for media errors, unsafe shutdowns
xfs_repair -n /dev/nvme0n1p1 (or appropriate fsck) -- dry-run check
Rejoin the cluster: Follow the database procedure to rebuild/resync the local cache tier.
If reseat does not work:
Test the drive in another slot or another server
Test the slot with a known-good NVMe drive
If the drive is dead, initiate RMA with Samsung; if the slot is dead, RMA the riser/server

Rollback / Safety¶

The database cluster is designed to tolerate one node's cache being offline; confirm cluster health before and after.
Do not attempt hot-plug of NVMe unless the server and backplane explicitly support it.
If the drive had unsaved data, do NOT run fsck with repair flags until a backup assessment is done.
Keep the maintenance window short; the cluster is already degraded.

Common Traps¶

Assuming drive failure immediately: Many NVMe "disappearances" are connector/seating issues, not dead drives. Always reseat first.
Ignoring BIOS bifurcation: After BIOS updates, PCIe bifurcation settings can silently reset to defaults, making NVMe slots invisible.
Forgetting to check the riser card: On rack servers, NVMe drives often connect through a riser; a loose riser affects all devices on it.
Not checking dmesg timestamps: Errors from the current boot vs. previous boot can be confused. Use journalctl -b 0 for the current boot only.
Hot-plugging NVMe without support: Can cause data corruption or electrical damage if the backplane doesn't support it.