Skip to content

Incident Replay: NVMe Drive Disappeared

Setup

  • System context: High-performance compute node with 4x NVMe SSDs in a software RAID-10 array. The node runs real-time analytics workloads.
  • Time: Monday 05:30 UTC
  • Your role: On-call SRE / storage engineer

Round 1: Alert Fires

[Pressure cue: "RAID monitoring fires — /dev/md0 is degraded. One NVMe device has vanished from the kernel. Analytics pipeline is running slower."]

What you see: cat /proc/mdstat shows md0 is degraded with 3 of 4 devices active. lsblk shows only 3 NVMe devices — /dev/nvme2n1 is missing. dmesg shows "nvme nvme2: controller is down."

Choose your action: - A) Immediately hot-swap the NVMe drive - B) Check dmesg and NVMe SMART data for more context - C) Reboot the server to re-detect the drive - D) Start a RAID rebuild with a spare drive

If you chose A:

[Result: You pull the drive but it might have been a transient PCIe issue — the drive itself could be fine. You have committed to a replacement unnecessarily.]

[Result: dmesg shows "PCIe Bus Error: severity=Corrected, type=Physical Layer" followed by "nvme2: controller reset failed." SMART data from the remaining drives is healthy. The issue is on the PCIe bus, not the drive media. Proceed to Round 2.]

If you chose C:

[Result: Risky — if the drive is truly dead, reboot may hang on storage initialization. Also causes full service downtime instead of just degraded RAID.]

If you chose D:

[Result: You do not have a spare NVMe in this server. And the original drive might be fine — it is a PCIe issue.]

Round 2: First Triage Data

[Pressure cue: "RAID is degraded but functional. Analytics pipeline is 40% slower. Product team wants an ETA."]

What you see: PCIe correctable errors preceded the NVMe controller reset. The NVMe drive is physically present but the PCIe link is down. This could be a loose drive in the M.2/U.2 slot, a PCIe lane failure, or a thermal issue.

Choose your action: - A) Reseat the NVMe drive in its slot - B) Try to reset the PCIe bus from the OS: echo 1 > /sys/bus/pci/devices/<addr>/reset - C) Check the server's internal temperature sensors - D) Replace the NVMe drive with a spare

[Result: PCIe bus reset re-enumerates the device. nvme list shows /dev/nvme2n1 is back. mdadm --manage /dev/md0 --re-add /dev/nvme2n1p1 re-adds it to the RAID. Rebuild starts. Proceed to Round 3.]

If you chose A:

[Result: Requires physical access and potentially a server shutdown. If the datacenter is remote, this takes hours. Try software reset first.]

If you chose C:

[Result: Temperatures are within spec — 42C ambient, 55C on NVMe drives. Thermal is not the issue. 5 minutes wasted.]

If you chose D:

[Result: Premature — the drive might be perfectly healthy. The PCIe bus is the suspect, not the drive.]

Round 3: Root Cause Identification

[Pressure cue: "Drive is back, RAID rebuilding. Why did the PCIe link drop?"]

What you see: Root cause: PCIe Active State Power Management (ASPM) caused the NVMe controller to enter a low-power state it could not recover from. This is a known issue with this NVMe model + kernel version combination. The kernel parameter pcie_aspm=off is the recommended workaround.

Choose your action: - A) Add pcie_aspm=off to the kernel boot parameters - B) Update the NVMe firmware to a version that handles ASPM correctly - C) Disable ASPM only for this specific PCIe device - D) Both A and B for defense in depth

[Result: Kernel parameter added to GRUB for immediate protection. NVMe firmware update scheduled for next maintenance window to fix the root cause. Proceed to Round 4.]

If you chose A:

[Result: Quick fix but disables ASPM globally — slight increase in power consumption. Fine for servers.]

If you chose B:

[Result: Correct long-term fix but requires a reboot to apply. Schedule it.]

If you chose C:

[Result: Possible via sysfs but does not survive reboots without a udev rule. Fragile.]

Round 4: Remediation

[Pressure cue: "RAID rebuild complete. Verify and close."]

Actions: 1. Verify RAID is healthy: cat /proc/mdstat — all 4 devices active 2. Verify NVMe SMART: nvme smart-log /dev/nvme2n1 — no media errors 3. Verify kernel parameter: cat /proc/cmdline | grep pcie_aspm 4. Check fleet for same NVMe model + kernel combination 5. Schedule firmware updates for affected drives across the fleet

Damage Report

  • Total downtime: 0 (RAID was degraded, not down)
  • Blast radius: Single compute node; analytics pipeline ran at 60% throughput for 45 minutes
  • Optimal resolution time: 15 minutes (PCIe reset -> RAID re-add -> kernel parameter)
  • If every wrong choice was made: 4+ hours including unnecessary drive replacement and server downtime

Cross-References