Skip to content

Solution: NVMe Drive Disappeared After Reboot

Triage

  1. Confirm the drive is not visible at any layer:
  2. lspci | grep -i nvme -- check PCIe enumeration
  3. lsblk -- check block device presence
  4. lsmod | grep nvme -- confirm driver modules loaded
  5. dmesg | grep -iE 'nvme|pci.*slot' -- look for enumeration or driver errors
  6. Check the BMC/iDRAC system event log for hardware alerts:
  7. ipmitool sel list or iDRAC web UI under Hardware > Storage
  8. Verify no BIOS settings changed during the patching window:
  9. Check if BIOS was updated; review BIOS changelog for PCIe bifurcation changes
  10. Enter BIOS setup and confirm PCIe slot 3 is enabled and bifurcation is set correctly (x4 for NVMe)

Root Cause

The most common causes for an NVMe drive disappearing after reboot:

  1. PCIe slot/riser issue: The drive or riser card became partially unseated. Thermal cycling during reboot can cause subtle connector shifts.
  2. BIOS/firmware change: A BIOS update reset PCIe bifurcation settings, disabling the NVMe slot or changing it from x4 to x16 mode when the backplane expects bifurcated mode.
  3. Drive failure: The NVMe controller on the drive has failed. Check if the drive is warm to the touch (power present but controller hung) or cold (no power at all).
  4. Kernel/driver regression: A kernel update removed or broke NVMe support for this specific controller. Less likely but possible.

In this scenario, the root cause is a partially unseated drive after thermal cycling combined with a marginal connector that was previously making just enough contact.

Fix

  1. Immediate: Schedule a brief maintenance window (5 min) to physically reseat the NVMe drive:
  2. Power down the server gracefully
  3. Remove and firmly re-insert the NVMe drive in PCIe slot 3
  4. Inspect the PCIe connector and backplane for damage or debris
  5. Power on and verify the drive appears in BIOS POST, lspci, and lsblk
  6. Validate data integrity: Once the drive is back, check the filesystem:
  7. nvme smart-log /dev/nvme0n1 -- check for media errors, unsafe shutdowns
  8. xfs_repair -n /dev/nvme0n1p1 (or appropriate fsck) -- dry-run check
  9. Rejoin the cluster: Follow the database procedure to rebuild/resync the local cache tier.
  10. If reseat does not work:
  11. Test the drive in another slot or another server
  12. Test the slot with a known-good NVMe drive
  13. If the drive is dead, initiate RMA with Samsung; if the slot is dead, RMA the riser/server

Rollback / Safety

  • The database cluster is designed to tolerate one node's cache being offline; confirm cluster health before and after.
  • Do not attempt hot-plug of NVMe unless the server and backplane explicitly support it.
  • If the drive had unsaved data, do NOT run fsck with repair flags until a backup assessment is done.
  • Keep the maintenance window short; the cluster is already degraded.

Common Traps

  • Assuming drive failure immediately: Many NVMe "disappearances" are connector/seating issues, not dead drives. Always reseat first.
  • Ignoring BIOS bifurcation: After BIOS updates, PCIe bifurcation settings can silently reset to defaults, making NVMe slots invisible.
  • Forgetting to check the riser card: On rack servers, NVMe drives often connect through a riser; a loose riser affects all devices on it.
  • Not checking dmesg timestamps: Errors from the current boot vs. previous boot can be confused. Use journalctl -b 0 for the current boot only.
  • Hot-plugging NVMe without support: Can cause data corruption or electrical damage if the backplane doesn't support it.