Skip to content

Symptoms: Memory ECC Errors Increasing

  • Server db-replica-04 (Dell PowerEdge R740, 384GB RAM) is logging correctable ECC memory errors.
  • The error rate has been increasing: 12 errors last week, 47 errors yesterday, 89 errors today so far.
  • No application-visible impact yet -- correctable errors are silently fixed by the memory controller.
  • Monitoring picked this up via mcelog daemon alerts.
  • The server runs a MySQL read replica handling analytics queries.
  • The server has 12x 32GB DDR4 RDIMMs across two memory channels per CPU.
  • The server has been in production for 26 months.
  • No other servers in the cluster are showing elevated ECC error counts.