Questions: Memory ECC Errors Increasing¶
- Which specific DIMM slot is generating the errors?
- Are the errors on a single DIMM or spread across multiple DIMMs?
- Is the error rate accelerating (trend toward uncorrectable errors)?
- What is the DIMM part number, serial number, and manufacturer?
- Is the server under warranty and eligible for DIMM replacement?
- Can the workload be failed over to another replica before replacement?
- Is the DIMM in a slot that requires a server shutdown for replacement (not hot-swap)?
- Are there any uncorrectable errors (UEs) in addition to correctable errors (CEs)?
- Could this be caused by a memory controller issue rather than the DIMM itself?
- What is the EDAC (Error Detection and Correction) subsystem showing in sysfs?