Incident Replay: Memory ECC Errors Increasing¶

Setup¶

System context: Production database server (Dell R640, 512GB RAM) with ECC memory. EDAC (Error Detection and Correction) counters have been incrementing for the past 48 hours.
Time: Thursday 08:00 UTC
Your role: Systems engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Hardware monitoring alert — correctable ECC error count on prod-db-05 exceeded threshold (100 errors in 24 hours). No service impact yet, but hardware team wants triage before the rate accelerates."]

What you see: edac-util --status shows correctable errors on DIMM_A3 accumulating at ~5/hour. mcelog confirms single-bit correctable errors on the same DIMM. No uncorrectable errors yet.

Choose your action: - A) Immediately migrate workloads and shut down the server for DIMM replacement - B) Monitor the error rate trend and check if it is accelerating - C) Reboot the server to clear the ECC counters - D) Disable the DIMM in BIOS to prevent potential uncorrectable errors

If you chose A:¶

[Result: Workload migration takes 30 minutes. The DIMM may have been fine for weeks at this rate. Premature — you have not assessed urgency.]

If you chose B (recommended):¶

[Result: Checking the last 48 hours of mcelog data shows the error rate is doubling every 12 hours: 2/hr -> 5/hr -> trending to 10/hr. Accelerating error rate indicates a failing DIMM — not a cosmic ray event. Proceed to Round 2.]

If you chose C:¶

[Result: Reboot clears the counters but the errors resume immediately. You have lost your trend data and gained nothing. 5 minutes of downtime wasted.]

If you chose D:¶

[Result: Disabling the DIMM removes 64GB of RAM (512GB -> 448GB). The database uses all available memory for buffer cache. Performance degradation is immediate and significant.]

Round 2: First Triage Data¶

[Pressure cue: "Error rate is accelerating. If this becomes an uncorrectable error, the server will crash with a machine check exception."]

What you see: Error rate doubling every 12 hours. All errors on the same DIMM (DIMM_A3, slot 3, channel A). iDRAC hardware log confirms "Correctable memory error rate exceeded" for this DIMM. A spare DIMM is available in datacenter stock.

Choose your action: - A) Schedule DIMM replacement in the next maintenance window (3 days away) - B) Plan an emergency maintenance: migrate workload, replace DIMM, restore - C) Swap the DIMM hot — this server supports memory hot-plug - D) Run memtest86 to confirm the DIMM is actually faulty

If you chose B (recommended):¶

[Result: You schedule an emergency maintenance for tonight. Live-migrate the database to the replica, replace DIMM_A3, run memory diagnostics, restore. 2-hour window. Proceed to Round 3.]

If you chose A:¶

[Result: At the current doubling rate, the error rate will be 80/hr in 3 days. Risk of uncorrectable error (MCE/kernel panic) is significant. Too long to wait.]

If you chose C:¶

[Result: This server model does not actually support memory hot-plug in this configuration. Attempting it could cause a crash.]

If you chose D:¶

[Result: memtest86 requires a reboot and takes 4+ hours for 512GB. Production downtime for a diagnostic when the evidence already points to a bad DIMM. Overkill for triage.]

Round 3: Root Cause Identification¶

[Pressure cue: "Emergency maintenance approved. Execute the plan."]

What you see: Root cause: Physical DIMM degradation. The DIMM is 3 years old (within Dell's 5-year warranty). This is normal hardware aging — not a systemic issue. No other DIMMs on this server or fleet show elevated errors.

Choose your action: - A) Replace DIMM_A3 and run built-in diagnostics before restoring workload - B) Replace DIMM_A3 and immediately restore workload - C) Replace all DIMMs in channel A as a precaution - D) Replace DIMM_A3 and order spares for all servers of the same age

If you chose A (recommended):¶

[Result: DIMM replaced. Dell embedded diagnostics (F10 at boot) run memory test — passes clean. Server brought back online, workload migrated back. Proceed to Round 4.]

If you chose B:¶

[Result: Works but if the replacement DIMM is DOA, you will have to do another maintenance. Always verify after hardware changes.]

If you chose C:¶

[Result: Replacing healthy DIMMs is wasteful and extends the maintenance window. Only the failing DIMM needs replacement.]

If you chose D:¶

[Result: Ordering spares is wise but replacing on suspicion without evidence is wasteful. Order spares, replace only on evidence.]

Round 4: Remediation¶

[Pressure cue: "Server is back. Verify everything."]

Actions: 1. Verify zero ECC errors post-replacement: edac-util --status 2. Verify full memory available: free -h shows 512GB 3. Monitor for 24 hours to confirm stable 4. File Dell warranty claim for the failed DIMM 5. Review ECC monitoring thresholds — consider alerting at 10 errors/24hr for earlier detection

Damage Report¶

Total downtime: 25 minutes (planned maintenance, workload migrated)
Blast radius: Single server; database served from replica during maintenance
Optimal resolution time: 2 hours (trend analysis -> plan -> execute -> verify)
If every wrong choice was made: 6+ hours plus risk of unplanned kernel panic from uncorrectable error

Incident Replay: Memory ECC Errors Increasing¶

Setup¶

Round 1: Alert Fires¶

If you chose A:¶

If you chose B (recommended):¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶