Solution: Memory ECC Errors Increasing¶

Triage¶

Identify the failing DIMM and error details:

sudo mcelog --client
# Or check the log
grep -i "corrected memory" /var/log/mcelog

# Check EDAC subsystem
edac-util -s
edac-util -l

Map the error to a physical DIMM slot:
```
sudo dmidecode -t memory | grep -A 10 "Memory Device" | grep -E "Locator|Serial|Part|Size"
```
Cross-reference the mcelog socket/channel/DIMM-slot with dmidecode output.

Check for any uncorrectable errors:

edac-util -s | grep -i "UE"
grep -i "uncorrected\|uncorrectable" /var/log/mcelog

Check the trend:

grep -c "corrected" /var/log/mcelog
# Or review timestamps to confirm acceleration

Root Cause¶

A single DIMM (32GB RDIMM in slot A3 on CPU 1) is developing a hardware fault. The increasing rate of correctable ECC errors indicates the DIMM is degrading -- likely a failing cell or row in the DRAM chip. The memory controller is correcting these errors transparently, but the trend strongly predicts an uncorrectable error (UE) will occur if the DIMM is not replaced.

An uncorrectable error on a production database server would cause either: - A machine check exception (MCE) crash / kernel panic, or - Silent data corruption (if the error is in non-critical data)

Both outcomes are unacceptable for a database server.

Fix¶

Schedule a maintenance window (DIMM replacement requires shutdown):
Coordinate with the DBA team to fail over analytics queries to another replica.
Remove db-replica-04 from the load balancer or replica set.
Schedule a 30-minute maintenance window.
Document the failing DIMM for RMA:
```
sudo dmidecode -t memory | grep -B2 -A8 "Locator: A3"
```
Record: Slot (A3), Part Number, Serial Number, Manufacturer.
Open a support case with Dell (if under warranty):
Provide the DIMM location, serial number, and mcelog output.
Request a replacement DIMM (Dell ProSupport ships next business day).

During the maintenance window:

# Graceful shutdown
sudo systemctl stop mysql
sudo shutdown -h now

Replace the DIMM in slot A3.
Optionally run memtest86+ on the new DIMM (boot from USB).

Power on and verify:

sudo dmidecode -t memory | grep -A5 "Locator: A3"
free -h
edac-util -s    # Should show 0 errors

Post-replacement:
Start MySQL and verify replication catches up.
Add the server back to the replica set / load balancer.
Monitor ECC error counts for 48 hours to confirm the fix.

Rollback / Safety¶

If the replacement DIMM also shows errors, the issue may be the memory channel on the CPU. Try a different slot on the same channel.
If no replacement DIMM is available immediately, consider offlining the bad DIMM region using the kernel's memory hotplug or page-offlining feature (advanced, last resort).
The MySQL replica can be rebuilt from scratch if there are concerns about data integrity due to memory errors. Check binary log position before and after.
Keep the failed DIMM for RMA return -- do not discard it.

Common Traps¶

Trap: Ignoring correctable errors because "they're corrected." An increasing CE rate is the #1 predictor of an imminent uncorrectable error.
Trap: Replacing the wrong DIMM. The mcelog socket/channel/DIMM numbering does not always match the physical silkscreen labels. Always cross-reference with dmidecode.
Trap: Not failing over the workload before maintenance. A surprise UE during the planning period could crash the server before the scheduled window.
Trap: Assuming a single CE event is a problem. Isolated CEs at low rates (a few per month) are normal and expected. It is the increasing trend that indicates a failing DIMM.
Trap: Not checking if other DIMMs from the same manufacturing batch are also showing early signs. If one fails at 26 months, siblings may follow.