Incident Replay: Server Intermittent Reboots¶

Setup¶

System context: Production application server that has rebooted unexpectedly 3 times in the past week. No pattern in timing. Workload is a Java application with 64GB heap.
Time: Monday 14:20 UTC
Your role: On-call SRE / systems engineer

Round 1: Alert Fires¶

[Pressure cue: "Server app-prod-11 just rebooted again — third time this week. Application team is losing confidence. They want root cause before the next one."]

What you see: last reboot shows 3 unplanned reboots in 7 days at irregular intervals (2 days, 3 days, 2 days). journalctl --list-boots shows no graceful shutdown entries — the reboots are hard resets.

Choose your action: - A) Check system logs for kernel panics or MCE (Machine Check Exception) events - B) Check the application logs for crash-before-reboot patterns - C) Replace the server hardware preemptively - D) Add a watchdog timer to capture more data on the next reboot

If you chose A (recommended):¶

[Result: journalctl -b -1 -p 0..3 from the last boot shows nothing — the log ends abruptly. But mcelog --client shows 3 Machine Check Exception events correlating with the reboot times. MCE type: "corrected memory error threshold exceeded, escalated to fatal." Proceed to Round 2.]

If you chose B:¶

[Result: Application logs show no errors before the reboots — the process was killed mid-operation. This is a hardware-level event, not an application crash.]

If you chose C:¶

[Result: You have not identified the component. Replacing the whole server is wasteful and takes hours of migration.]

If you chose D:¶

[Result: Watchdog timer is useful for future diagnosis but does not explain the 3 reboots that already happened. mcelog already has the data.]

Round 2: First Triage Data¶

[Pressure cue: "MCE events found. What hardware component is failing?"]

What you see: MCE events point to memory bank 3, DIMM slot C2. edac-util --status shows 847 correctable errors on this DIMM. iDRAC hardware log shows "Memory ECC threshold exceeded" entries matching the reboot times.

Choose your action: - A) Schedule immediate DIMM replacement - B) Run memtest86 to confirm the DIMM is faulty - C) Disable the DIMM in BIOS to prevent further reboots until replacement - D) Increase the MCE error threshold to prevent the escalation to fatal

If you chose A (recommended):¶

[Result: Plan the replacement: migrate workload to another server, replace DIMM C2, run diagnostics, restore. The MCE data is clear — the DIMM is failing. No need for additional testing. Proceed to Round 3.]

If you chose B:¶

[Result: memtest86 takes 6+ hours on a 256GB server and requires downtime. The MCE data already confirms the issue. Unnecessary.]

If you chose C:¶

[Result: Disabling the DIMM removes 32GB of RAM. The Java heap (64GB) now exceeds available memory. OOM kills follow. Not viable without reducing the heap.]

If you chose D:¶

[Result: Increasing the threshold masks the problem. The DIMM will continue to degrade and may produce uncorrectable errors that corrupt data silently.]

Round 3: Root Cause Identification¶

[Pressure cue: "DIMM identified. Execute the fix."]

What you see: Root cause: DIMM C2 is physically failing. Correctable ECC errors accumulated beyond the threshold, causing the kernel's MCE handler to trigger a fatal machine check and force a hardware reset. The server's error policy is set to "panic on threshold" which is correct but aggressive.

Choose your action: - A) Replace the DIMM and adjust MCE policy to log-and-alert before panic - B) Replace the DIMM only - C) Replace the DIMM and add proactive DIMM health monitoring - D) Replace all DIMMs in the same bank as a precaution

If you chose C (recommended):¶

[Result: DIMM replaced. Proactive monitoring added: alert at 50 correctable errors/24hr (well before the panic threshold of 1000). Early warning prevents future surprise reboots. Proceed to Round 4.]

If you chose A:¶

[Result: Adjusting MCE policy to not panic is dangerous — if the errors become uncorrectable, you get silent data corruption instead of a clean crash.]

If you chose B:¶

[Result: Fixes the immediate issue but does not improve early detection.]

If you chose D:¶

[Result: Replacing healthy DIMMs is wasteful. Only the failing one needs replacement.]

Round 4: Remediation¶

[Pressure cue: "Server restored. Verify stability."]

Actions: 1. Verify new DIMM is detected: dmidecode -t 17 | grep -A5 "Locator: C2" 2. Verify zero MCE events: mcelog --client 3. Verify zero ECC errors: edac-util --status 4. Monitor for 48 hours before declaring stable 5. File warranty claim for the failed DIMM

Damage Report¶

Total downtime: 3x ~5 minutes (3 unplanned reboots) + 20 minutes planned maintenance
Blast radius: Application service interruptions during each reboot; potential data integrity concerns for in-flight transactions
Optimal resolution time: 60 minutes (diagnose MCE -> identify DIMM -> replace)
If every wrong choice was made: Days of continued intermittent reboots, possible data corruption

Incident Replay: Server Intermittent Reboots¶

Setup¶

Round 1: Alert Fires¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose C (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶