Skip to content

Incident Replay: Server Intermittent Reboots

Setup

  • System context: Production application server that has rebooted unexpectedly 3 times in the past week. No pattern in timing. Workload is a Java application with 64GB heap.
  • Time: Monday 14:20 UTC
  • Your role: On-call SRE / systems engineer

Round 1: Alert Fires

[Pressure cue: "Server app-prod-11 just rebooted again — third time this week. Application team is losing confidence. They want root cause before the next one."]

What you see: last reboot shows 3 unplanned reboots in 7 days at irregular intervals (2 days, 3 days, 2 days). journalctl --list-boots shows no graceful shutdown entries — the reboots are hard resets.

Choose your action: - A) Check system logs for kernel panics or MCE (Machine Check Exception) events - B) Check the application logs for crash-before-reboot patterns - C) Replace the server hardware preemptively - D) Add a watchdog timer to capture more data on the next reboot

[Result: journalctl -b -1 -p 0..3 from the last boot shows nothing — the log ends abruptly. But mcelog --client shows 3 Machine Check Exception events correlating with the reboot times. MCE type: "corrected memory error threshold exceeded, escalated to fatal." Proceed to Round 2.]

If you chose B:

[Result: Application logs show no errors before the reboots — the process was killed mid-operation. This is a hardware-level event, not an application crash.]

If you chose C:

[Result: You have not identified the component. Replacing the whole server is wasteful and takes hours of migration.]

If you chose D:

[Result: Watchdog timer is useful for future diagnosis but does not explain the 3 reboots that already happened. mcelog already has the data.]

Round 2: First Triage Data

[Pressure cue: "MCE events found. What hardware component is failing?"]

What you see: MCE events point to memory bank 3, DIMM slot C2. edac-util --status shows 847 correctable errors on this DIMM. iDRAC hardware log shows "Memory ECC threshold exceeded" entries matching the reboot times.

Choose your action: - A) Schedule immediate DIMM replacement - B) Run memtest86 to confirm the DIMM is faulty - C) Disable the DIMM in BIOS to prevent further reboots until replacement - D) Increase the MCE error threshold to prevent the escalation to fatal

[Result: Plan the replacement: migrate workload to another server, replace DIMM C2, run diagnostics, restore. The MCE data is clear — the DIMM is failing. No need for additional testing. Proceed to Round 3.]

If you chose B:

[Result: memtest86 takes 6+ hours on a 256GB server and requires downtime. The MCE data already confirms the issue. Unnecessary.]

If you chose C:

[Result: Disabling the DIMM removes 32GB of RAM. The Java heap (64GB) now exceeds available memory. OOM kills follow. Not viable without reducing the heap.]

If you chose D:

[Result: Increasing the threshold masks the problem. The DIMM will continue to degrade and may produce uncorrectable errors that corrupt data silently.]

Round 3: Root Cause Identification

[Pressure cue: "DIMM identified. Execute the fix."]

What you see: Root cause: DIMM C2 is physically failing. Correctable ECC errors accumulated beyond the threshold, causing the kernel's MCE handler to trigger a fatal machine check and force a hardware reset. The server's error policy is set to "panic on threshold" which is correct but aggressive.

Choose your action: - A) Replace the DIMM and adjust MCE policy to log-and-alert before panic - B) Replace the DIMM only - C) Replace the DIMM and add proactive DIMM health monitoring - D) Replace all DIMMs in the same bank as a precaution

[Result: DIMM replaced. Proactive monitoring added: alert at 50 correctable errors/24hr (well before the panic threshold of 1000). Early warning prevents future surprise reboots. Proceed to Round 4.]

If you chose A:

[Result: Adjusting MCE policy to not panic is dangerous — if the errors become uncorrectable, you get silent data corruption instead of a clean crash.]

If you chose B:

[Result: Fixes the immediate issue but does not improve early detection.]

If you chose D:

[Result: Replacing healthy DIMMs is wasteful. Only the failing one needs replacement.]

Round 4: Remediation

[Pressure cue: "Server restored. Verify stability."]

Actions: 1. Verify new DIMM is detected: dmidecode -t 17 | grep -A5 "Locator: C2" 2. Verify zero MCE events: mcelog --client 3. Verify zero ECC errors: edac-util --status 4. Monitor for 48 hours before declaring stable 5. File warranty claim for the failed DIMM

Damage Report

  • Total downtime: 3x ~5 minutes (3 unplanned reboots) + 20 minutes planned maintenance
  • Blast radius: Application service interruptions during each reboot; potential data integrity concerns for in-flight transactions
  • Optimal resolution time: 60 minutes (diagnose MCE -> identify DIMM -> replace)
  • If every wrong choice was made: Days of continued intermittent reboots, possible data corruption

Cross-References