Grading Checklist¶
- Reviewed BMC/iDRAC system event log for the exact event type (power cycle vs. reset vs. OS shutdown)
- Distinguished between OS-level crash and hardware-level power cycle
- Checked
mcelogfor uncorrectable memory errors and correlated with reboot timestamps - Investigated PSU health and power event logs
- Verified kdump is configured and explained why it may not capture this type of failure
- Considered hardware causes: failing PSU, loose power cable, bad memory DIMM
- Checked for thermal throttling or shutdown events
- Reviewed BIOS settings for behavior on uncorrectable errors
- Proposed diagnostic steps: run memory diagnostics, swap suspect DIMM, check PSU redundancy
- Addressed the database impact and recommended failover while diagnosing
- Considered firmware/BIOS update as part of the resolution
- Mentioned physical inspection: power cable seating, PSU seating, DIMM seating