Solution: Memory ECC Errors Increasing¶
Triage¶
-
Identify the failing DIMM and error details:
-
Map the error to a physical DIMM slot:
Cross-reference the mcelog socket/channel/DIMM-slot with dmidecode output. -
Check for any uncorrectable errors:
-
Check the trend:
Root Cause¶
A single DIMM (32GB RDIMM in slot A3 on CPU 1) is developing a hardware fault. The increasing rate of correctable ECC errors indicates the DIMM is degrading -- likely a failing cell or row in the DRAM chip. The memory controller is correcting these errors transparently, but the trend strongly predicts an uncorrectable error (UE) will occur if the DIMM is not replaced.
An uncorrectable error on a production database server would cause either: - A machine check exception (MCE) crash / kernel panic, or - Silent data corruption (if the error is in non-critical data)
Both outcomes are unacceptable for a database server.
Fix¶
- Schedule a maintenance window (DIMM replacement requires shutdown):
- Coordinate with the DBA team to fail over analytics queries to another replica.
- Remove
db-replica-04from the load balancer or replica set. -
Schedule a 30-minute maintenance window.
-
Document the failing DIMM for RMA:
Record: Slot (A3), Part Number, Serial Number, Manufacturer. -
Open a support case with Dell (if under warranty):
- Provide the DIMM location, serial number, and mcelog output.
-
Request a replacement DIMM (Dell ProSupport ships next business day).
-
During the maintenance window:
- Replace the DIMM in slot A3.
- Optionally run memtest86+ on the new DIMM (boot from USB).
-
Power on and verify:
-
Post-replacement:
- Start MySQL and verify replication catches up.
- Add the server back to the replica set / load balancer.
- Monitor ECC error counts for 48 hours to confirm the fix.
Rollback / Safety¶
- If the replacement DIMM also shows errors, the issue may be the memory channel on the CPU. Try a different slot on the same channel.
- If no replacement DIMM is available immediately, consider offlining the bad DIMM region using the kernel's memory hotplug or page-offlining feature (advanced, last resort).
- The MySQL replica can be rebuilt from scratch if there are concerns about data integrity due to memory errors. Check binary log position before and after.
- Keep the failed DIMM for RMA return -- do not discard it.
Common Traps¶
- Trap: Ignoring correctable errors because "they're corrected." An increasing CE rate is the #1 predictor of an imminent uncorrectable error.
- Trap: Replacing the wrong DIMM. The mcelog socket/channel/DIMM numbering does not always match the physical silkscreen labels. Always cross-reference with
dmidecode. - Trap: Not failing over the workload before maintenance. A surprise UE during the planning period could crash the server before the scheduled window.
- Trap: Assuming a single CE event is a problem. Isolated CEs at low rates (a few per month) are normal and expected. It is the increasing trend that indicates a failing DIMM.
- Trap: Not checking if other DIMMs from the same manufacturing batch are also showing early signs. If one fails at 26 months, siblings may follow.