Incident Replay: NVMe Drive Disappeared¶
Setup¶
- System context: High-performance compute node with 4x NVMe SSDs in a software RAID-10 array. The node runs real-time analytics workloads.
- Time: Monday 05:30 UTC
- Your role: On-call SRE / storage engineer
Round 1: Alert Fires¶
[Pressure cue: "RAID monitoring fires — /dev/md0 is degraded. One NVMe device has vanished from the kernel. Analytics pipeline is running slower."]
What you see:
cat /proc/mdstat shows md0 is degraded with 3 of 4 devices active. lsblk shows only 3 NVMe devices — /dev/nvme2n1 is missing. dmesg shows "nvme nvme2: controller is down."
Choose your action:
- A) Immediately hot-swap the NVMe drive
- B) Check dmesg and NVMe SMART data for more context
- C) Reboot the server to re-detect the drive
- D) Start a RAID rebuild with a spare drive
If you chose A:¶
[Result: You pull the drive but it might have been a transient PCIe issue — the drive itself could be fine. You have committed to a replacement unnecessarily.]
If you chose B (recommended):¶
[Result:
dmesgshows "PCIe Bus Error: severity=Corrected, type=Physical Layer" followed by "nvme2: controller reset failed." SMART data from the remaining drives is healthy. The issue is on the PCIe bus, not the drive media. Proceed to Round 2.]
If you chose C:¶
[Result: Risky — if the drive is truly dead, reboot may hang on storage initialization. Also causes full service downtime instead of just degraded RAID.]
If you chose D:¶
[Result: You do not have a spare NVMe in this server. And the original drive might be fine — it is a PCIe issue.]
Round 2: First Triage Data¶
[Pressure cue: "RAID is degraded but functional. Analytics pipeline is 40% slower. Product team wants an ETA."]
What you see: PCIe correctable errors preceded the NVMe controller reset. The NVMe drive is physically present but the PCIe link is down. This could be a loose drive in the M.2/U.2 slot, a PCIe lane failure, or a thermal issue.
Choose your action:
- A) Reseat the NVMe drive in its slot
- B) Try to reset the PCIe bus from the OS: echo 1 > /sys/bus/pci/devices/<addr>/reset
- C) Check the server's internal temperature sensors
- D) Replace the NVMe drive with a spare
If you chose B (recommended):¶
[Result: PCIe bus reset re-enumerates the device.
nvme listshows/dev/nvme2n1is back.mdadm --manage /dev/md0 --re-add /dev/nvme2n1p1re-adds it to the RAID. Rebuild starts. Proceed to Round 3.]
If you chose A:¶
[Result: Requires physical access and potentially a server shutdown. If the datacenter is remote, this takes hours. Try software reset first.]
If you chose C:¶
[Result: Temperatures are within spec — 42C ambient, 55C on NVMe drives. Thermal is not the issue. 5 minutes wasted.]
If you chose D:¶
[Result: Premature — the drive might be perfectly healthy. The PCIe bus is the suspect, not the drive.]
Round 3: Root Cause Identification¶
[Pressure cue: "Drive is back, RAID rebuilding. Why did the PCIe link drop?"]
What you see:
Root cause: PCIe Active State Power Management (ASPM) caused the NVMe controller to enter a low-power state it could not recover from. This is a known issue with this NVMe model + kernel version combination. The kernel parameter pcie_aspm=off is the recommended workaround.
Choose your action:
- A) Add pcie_aspm=off to the kernel boot parameters
- B) Update the NVMe firmware to a version that handles ASPM correctly
- C) Disable ASPM only for this specific PCIe device
- D) Both A and B for defense in depth
If you chose D (recommended):¶
[Result: Kernel parameter added to GRUB for immediate protection. NVMe firmware update scheduled for next maintenance window to fix the root cause. Proceed to Round 4.]
If you chose A:¶
[Result: Quick fix but disables ASPM globally — slight increase in power consumption. Fine for servers.]
If you chose B:¶
[Result: Correct long-term fix but requires a reboot to apply. Schedule it.]
If you chose C:¶
[Result: Possible via sysfs but does not survive reboots without a udev rule. Fragile.]
Round 4: Remediation¶
[Pressure cue: "RAID rebuild complete. Verify and close."]
Actions:
1. Verify RAID is healthy: cat /proc/mdstat — all 4 devices active
2. Verify NVMe SMART: nvme smart-log /dev/nvme2n1 — no media errors
3. Verify kernel parameter: cat /proc/cmdline | grep pcie_aspm
4. Check fleet for same NVMe model + kernel combination
5. Schedule firmware updates for affected drives across the fleet
Damage Report¶
- Total downtime: 0 (RAID was degraded, not down)
- Blast radius: Single compute node; analytics pipeline ran at 60% throughput for 45 minutes
- Optimal resolution time: 15 minutes (PCIe reset -> RAID re-add -> kernel parameter)
- If every wrong choice was made: 4+ hours including unnecessary drive replacement and server downtime
Cross-References¶
- Primer: Storage Ops
- Primer: Datacenter & Server Hardware
- Primer: Linux Kernel Tuning
- Footguns: Storage Ops