Questions to Determine¶
- Is the BMC event log showing the reboots as OS-initiated or hardware-initiated power cycles?
- Are there any thermal events (CPU, inlet, exhaust temperature warnings) preceding the reboots?
- Is the PSU event log clean, or are there power fault/loss events?
- Does
mcelogshow uncorrectable memory errors (UCEs), not just correctable ones? - Is the BIOS configured to reboot on UCE, or does it halt?
- Could the CMOS battery be failing, causing intermittent BIOS instability?
- Are there any firmware bugs known for this iDRAC/BIOS version that cause spurious reboots?
- Is there a hardware watchdog timer that could be triggering the reboot?
- Have the power cables and PDU outlets been checked for intermittent contact?
- Is kdump configured and functional? If so, why are no crash dumps being generated?