Solution: Server Intermittent Reboots¶
Triage¶
- Failover the database: Before extensive diagnostics, migrate the primary database role to the standby replica to protect the application.
- Review the iDRAC System Event Log (SEL):
ipmitool sel listor iDRAC web UI > Lifecycle Controller > System Event Log- Filter for events in the last 48 hours
- Look at the exact event type for each reboot: "Power Cycle" vs. "OS Stop" vs. "Watchdog Reset"
- Review
mcelog: mcelog --client-- check for recent eventsgrep -i uncorrect /var/log/mcelog-- specifically look for UCEs- Note which DIMM slot and bank the errors are on
- Check if kdump is configured:
systemctl status kdump-- is the service active?ls /var/crash/-- any crash dumps?- If no dumps exist for a hardware power cycle, kdump cannot capture it (CPU loses power instantly)
Root Cause¶
The iDRAC SEL reveals "System Power Cycle" events that are not preceded by any OS shutdown sequence. This indicates a hardware-level power cycle, not an OS crash. The mcelog data shows increasing correctable errors on DIMM A3, and the iDRAC SEL also shows "Memory Correctable ECC" events escalating in frequency.
The root cause is a failing DIMM (A3) that is producing intermittent uncorrectable memory errors. The BIOS is configured with "Memory Error Action: Power Cycle" (Dell default), so when a UCE occurs, the server immediately power-cycles rather than halting. The UCE is too sudden for the OS to log anything or for kdump to trigger -- the hardware cuts power before the kernel can react.
Fix¶
- Immediate: Disable the failing DIMM in BIOS:
- Enter BIOS Setup > Memory Settings > Memory Operating Mode
- Or use iDRAC:
racadm set BIOS.MemSettings.MemOpMode OptimizerMode - Alternatively, physically remove DIMM A3 to eliminate the error source
- Validate: Run the Dell embedded diagnostics (F10 at POST > Memory Test):
- Run extended memory test to confirm DIMM A3 is faulty
- Check adjacent DIMMs in the same channel
- Replace the DIMM: Order a replacement and install during the next maintenance window.
- Change BIOS error behavior (recommended):
- Set "Memory Error Action" to "Halt" instead of "Power Cycle"
- This prevents silent reboots and makes hardware memory errors visible
- Verify resolution: After DIMM replacement, monitor for 48+ hours:
mcelog --client-- should show zero new errorsipmitool sel list-- no new power cycle events- Restore the database primary role to this server
Rollback / Safety¶
- The database is already failed over to the replica; this server is non-critical during diagnostics.
- Removing one DIMM reduces total RAM; verify the database can operate with reduced memory.
- If the memory test passes on DIMM A3, the issue may be the memory channel or CPU socket -- escalate to Dell support.
Common Traps¶
- Assuming it's an OS crash: No kernel panic, no crash dump, no shutdown sequence = this is hardware, not software. Don't waste time analyzing OS logs.
- Ignoring correctable errors: Correctable errors are a leading indicator of imminent DIMM failure. A steady increase in correctable errors almost always precedes UCEs.
- Not checking BIOS error action: The "Power Cycle on UCE" default behavior masks the root cause. Change it to "Halt" so the error screen is visible.
- Running diagnostics under load: If the server is still serving production traffic during memory diagnostics, you risk data corruption from the bad DIMM.
- Replacing the wrong DIMM: DIMM slot naming varies by vendor. Verify A3 in the BIOS/iDRAC maps to the correct physical slot using the server's hardware manual.