Grading Checklist: Memory ECC Errors Increasing¶
A good response must include:
- Used
mcelog --clientor reviewed mcelog logs to identify the failing DIMM location - Checked EDAC sysfs (
/sys/devices/system/edac/mc/) for error counts per channel/DIMM - Used
dmidecode -t memoryto identify the specific DIMM (slot, part number, serial) - Assessed the error trend -- increasing CE rate is a strong predictor of imminent UE
- Confirmed no uncorrectable errors have occurred yet
- Planned DIMM replacement with a maintenance window (DIMMs are not hot-swappable)
- Proposed failing over the MySQL replica workload before the maintenance window
- Checked warranty status and initiated RMA/replacement process
- Considered running memtest86+ during the maintenance window to validate the replacement
- Documented the DIMM serial number and failure data for vendor RMA
- Recommended ongoing monitoring of ECC error rates across the fleet
- Assessed whether adjacent DIMMs on the same channel should be proactively tested