Skip to content

Grading Checklist: Memory ECC Errors Increasing

A good response must include:

  • Used mcelog --client or reviewed mcelog logs to identify the failing DIMM location
  • Checked EDAC sysfs (/sys/devices/system/edac/mc/) for error counts per channel/DIMM
  • Used dmidecode -t memory to identify the specific DIMM (slot, part number, serial)
  • Assessed the error trend -- increasing CE rate is a strong predictor of imminent UE
  • Confirmed no uncorrectable errors have occurred yet
  • Planned DIMM replacement with a maintenance window (DIMMs are not hot-swappable)
  • Proposed failing over the MySQL replica workload before the maintenance window
  • Checked warranty status and initiated RMA/replacement process
  • Considered running memtest86+ during the maintenance window to validate the replacement
  • Documented the DIMM serial number and failure data for vendor RMA
  • Recommended ongoing monitoring of ECC error rates across the fleet
  • Assessed whether adjacent DIMMs on the same channel should be proactively tested