Incident Replay: Server Intermittent Reboots¶
Setup¶
- System context: Production application server that has rebooted unexpectedly 3 times in the past week. No pattern in timing. Workload is a Java application with 64GB heap.
- Time: Monday 14:20 UTC
- Your role: On-call SRE / systems engineer
Round 1: Alert Fires¶
[Pressure cue: "Server app-prod-11 just rebooted again — third time this week. Application team is losing confidence. They want root cause before the next one."]
What you see:
last reboot shows 3 unplanned reboots in 7 days at irregular intervals (2 days, 3 days, 2 days). journalctl --list-boots shows no graceful shutdown entries — the reboots are hard resets.
Choose your action: - A) Check system logs for kernel panics or MCE (Machine Check Exception) events - B) Check the application logs for crash-before-reboot patterns - C) Replace the server hardware preemptively - D) Add a watchdog timer to capture more data on the next reboot
If you chose A (recommended):¶
[Result:
journalctl -b -1 -p 0..3from the last boot shows nothing — the log ends abruptly. Butmcelog --clientshows 3 Machine Check Exception events correlating with the reboot times. MCE type: "corrected memory error threshold exceeded, escalated to fatal." Proceed to Round 2.]
If you chose B:¶
[Result: Application logs show no errors before the reboots — the process was killed mid-operation. This is a hardware-level event, not an application crash.]
If you chose C:¶
[Result: You have not identified the component. Replacing the whole server is wasteful and takes hours of migration.]
If you chose D:¶
[Result: Watchdog timer is useful for future diagnosis but does not explain the 3 reboots that already happened. mcelog already has the data.]
Round 2: First Triage Data¶
[Pressure cue: "MCE events found. What hardware component is failing?"]
What you see:
MCE events point to memory bank 3, DIMM slot C2. edac-util --status shows 847 correctable errors on this DIMM. iDRAC hardware log shows "Memory ECC threshold exceeded" entries matching the reboot times.
Choose your action: - A) Schedule immediate DIMM replacement - B) Run memtest86 to confirm the DIMM is faulty - C) Disable the DIMM in BIOS to prevent further reboots until replacement - D) Increase the MCE error threshold to prevent the escalation to fatal
If you chose A (recommended):¶
[Result: Plan the replacement: migrate workload to another server, replace DIMM C2, run diagnostics, restore. The MCE data is clear — the DIMM is failing. No need for additional testing. Proceed to Round 3.]
If you chose B:¶
[Result: memtest86 takes 6+ hours on a 256GB server and requires downtime. The MCE data already confirms the issue. Unnecessary.]
If you chose C:¶
[Result: Disabling the DIMM removes 32GB of RAM. The Java heap (64GB) now exceeds available memory. OOM kills follow. Not viable without reducing the heap.]
If you chose D:¶
[Result: Increasing the threshold masks the problem. The DIMM will continue to degrade and may produce uncorrectable errors that corrupt data silently.]
Round 3: Root Cause Identification¶
[Pressure cue: "DIMM identified. Execute the fix."]
What you see: Root cause: DIMM C2 is physically failing. Correctable ECC errors accumulated beyond the threshold, causing the kernel's MCE handler to trigger a fatal machine check and force a hardware reset. The server's error policy is set to "panic on threshold" which is correct but aggressive.
Choose your action: - A) Replace the DIMM and adjust MCE policy to log-and-alert before panic - B) Replace the DIMM only - C) Replace the DIMM and add proactive DIMM health monitoring - D) Replace all DIMMs in the same bank as a precaution
If you chose C (recommended):¶
[Result: DIMM replaced. Proactive monitoring added: alert at 50 correctable errors/24hr (well before the panic threshold of 1000). Early warning prevents future surprise reboots. Proceed to Round 4.]
If you chose A:¶
[Result: Adjusting MCE policy to not panic is dangerous — if the errors become uncorrectable, you get silent data corruption instead of a clean crash.]
If you chose B:¶
[Result: Fixes the immediate issue but does not improve early detection.]
If you chose D:¶
[Result: Replacing healthy DIMMs is wasteful. Only the failing one needs replacement.]
Round 4: Remediation¶
[Pressure cue: "Server restored. Verify stability."]
Actions:
1. Verify new DIMM is detected: dmidecode -t 17 | grep -A5 "Locator: C2"
2. Verify zero MCE events: mcelog --client
3. Verify zero ECC errors: edac-util --status
4. Monitor for 48 hours before declaring stable
5. File warranty claim for the failed DIMM
Damage Report¶
- Total downtime: 3x ~5 minutes (3 unplanned reboots) + 20 minutes planned maintenance
- Blast radius: Application service interruptions during each reboot; potential data integrity concerns for in-flight transactions
- Optimal resolution time: 60 minutes (diagnose MCE -> identify DIMM -> replace)
- If every wrong choice was made: Days of continued intermittent reboots, possible data corruption
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: Linux Memory Management
- Primer: Kernel Troubleshooting
- Footguns: Datacenter