Symptoms: Memory ECC Errors Increasing¶
- Server
db-replica-04(Dell PowerEdge R740, 384GB RAM) is logging correctable ECC memory errors. - The error rate has been increasing: 12 errors last week, 47 errors yesterday, 89 errors today so far.
- No application-visible impact yet -- correctable errors are silently fixed by the memory controller.
- Monitoring picked this up via mcelog daemon alerts.
- The server runs a MySQL read replica handling analytics queries.
- The server has 12x 32GB DDR4 RDIMMs across two memory channels per CPU.
- The server has been in production for 26 months.
- No other servers in the cluster are showing elevated ECC error counts.