Grading Checklist - GrokDevOps Wiki

Checked lspci output to confirm whether the device is visible on the PCIe bus

Reviewed dmesg for PCIe enumeration errors, AER messages, or NVMe driver failures

Verified NVMe kernel modules are loaded (lsmod | grep nvme)

Checked BMC/iDRAC system event log for hardware fault entries

Considered physical layer issues: drive seating, riser card, PCIe slot failure

Investigated whether BIOS/firmware changes during the patch window affected PCIe bifurcation or slot enablement settings

Proposed a physical reseat of the drive as a diagnostic step

Identified the need to test the slot with another device or the drive in another slot to isolate the fault

Addressed the application impact (degraded database cluster) and any immediate mitigation

Mentioned checking NVMe drive health via nvme smart-log if the drive becomes visible again

Considered thermal or power delivery issues as potential causes

Documented the resolution path and whether a drive RMA is needed

Grading Checklist¶