Incident Replay: Serial Console Output Garbled¶
Setup¶
- System context: Headless server in a remote datacenter. Primary access is via serial-over-LAN (SOL) through iDRAC. Server is not responding to SSH and the iDRAC virtual console shows garbled text.
- Time: Saturday 01:30 UTC
- Your role: On-call SRE (remote, no datacenter access)
Round 1: Alert Fires¶
[Pressure cue: "Server app-remote-03 is not responding to health checks. SSH times out. Only access is iDRAC serial console — but the output is garbage characters."]
What you see:
Connecting via ipmitool sol activate shows streams of garbled characters — misaligned text, wrong symbols, line breaks in wrong places. You cannot read any output or type commands.
Choose your action: - A) Reboot the server via iDRAC power control - B) Check the serial console baud rate settings - C) Try the iDRAC virtual console (graphical KVM) instead - D) Reset the iDRAC BMC
If you chose B (recommended):¶
[Result:
ipmitool sol info 1shows SOL baud rate is 19200. But the server's GRUB and kernel are configured for 115200 baud. The mismatch is causing garbled output. Proceed to Round 2.]
If you chose A:¶
[Result: Server reboots but the serial console is still garbled — the baud rate mismatch persists. You cannot even see POST output correctly.]
If you chose C:¶
[Result: iDRAC graphical console shows the server is at a kernel panic screen. Useful — you can now see the actual problem. But you still need serial console for future headless access. Partial win.]
If you chose D:¶
[Result: iDRAC reset takes 3 minutes. Serial console is still garbled after reset because the baud rate setting is persistent.]
Round 2: First Triage Data¶
[Pressure cue: "You can see via KVM that the server had a kernel panic. But you need serial console working for ongoing management."]
What you see:
The SOL baud rate was changed during a recent iDRAC firmware update that reset it to 19200 (default). The OS is configured for 115200 in GRUB (console=ttyS0,115200) and in /etc/default/grub.
Choose your action: - A) Change the SOL baud rate to 115200 via ipmitool - B) Change the OS serial console config to 19200 to match the BMC - C) Set both sides to 9600 for maximum compatibility - D) Disable serial console and rely on iDRAC KVM only
If you chose A (recommended):¶
[Result:
ipmitool sol set volatile-bit-rate 115200 1andipmitool sol set non-volatile-bit-rate 115200 1. Serial console output is now readable. You can see the kernel panic message. Proceed to Round 3.]
If you chose B:¶
[Result: Requires booting the OS to change the config, but the OS has panicked. Chicken-and-egg problem.]
If you chose C:¶
[Result: 9600 is slow and unnecessary. 115200 is the standard for modern servers.]
If you chose D:¶
[Result: KVM works but serial console is needed for boot-level debugging and automation. Do not give up a management tool.]
Round 3: Root Cause Identification¶
[Pressure cue: "Serial console is fixed. Now deal with the kernel panic."]
What you see: The kernel panic message reads: "Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)". The root filesystem is not found. This happened after a kernel update that did not rebuild the initramfs for the RAID driver.
Choose your action: - A) Boot from the previous kernel version via GRUB - B) Boot from a rescue image and rebuild initramfs - C) Reinstall the OS from PXE - D) Check if the RAID array is degraded via iDRAC hardware logs
If you chose A (recommended):¶
[Result: Via serial console, interrupt GRUB, select the previous kernel. Server boots successfully. Then rebuild initramfs for the new kernel:
dracut --force /boot/initramfs-$(uname -r).img. Proceed to Round 4.]
If you chose B:¶
[Result: Works but requires PXE or USB rescue boot. More complex when you only have remote access.]
If you chose C:¶
[Result: Overkill — the server is fine, just needs the right kernel or initramfs fix.]
If you chose D:¶
[Result: RAID is healthy — the issue is the initramfs missing the RAID driver for the new kernel. Hardware is fine.]
Round 4: Remediation¶
[Pressure cue: "Server is back. Prevent recurrence."]
Actions: 1. Verify server is running and services are healthy 2. Rebuild initramfs for the new kernel and reboot to verify 3. Add SOL baud rate to the post-firmware-update verification checklist 4. Add initramfs rebuild verification to the kernel update automation 5. Document the serial console configuration standard (115200 baud) for all servers
Damage Report¶
- Total downtime: 45 minutes (kernel panic + diagnosis + recovery)
- Blast radius: Single application server; traffic rerouted to other instances
- Optimal resolution time: 15 minutes (fix baud rate -> read panic -> boot old kernel)
- If every wrong choice was made: 90+ minutes with blind troubleshooting and unnecessary reinstalls
Cross-References¶
- Primer: Datacenter & Server Hardware
- Primer: IPMI & ipmitool
- Primer: Linux Boot Process
- Footguns: Datacenter