Kernel Troubleshooting Footguns¶
- Ignoring dmesg warnings because "the server is running fine."
dmesg shows
EXT4-fs errormessages and SMART warnings about reallocated sectors. The server is still serving traffic so nobody investigates. Two weeks later the disk fails, the filesystem is corrupted, and recovery takes a full day because the backups were on the same volume.
Fix: Monitor dmesg for errors as part of your standard health checks.
Alert on any message at level err or above. Treat disk warnings
(reallocated sectors, pending sectors, I/O errors) as urgent — they mean
the disk is actively failing. Replace the disk, don't wait for full failure.
- No kdump configured — discovering this during a panic. The kernel panics. You reboot. The only evidence is a vague "machine restarted unexpectedly" and whatever the console captured (often nothing if IPMI isn't configured). Without a crash dump, root cause analysis is pure guesswork.
Fix: Configure kdump on every production system as part of your base
image. Verify it's running with kdumpctl status. Test it during initial
provisioning by triggering a test panic in a maintenance window. A crash
dump you can't capture is forensic evidence destroyed.
Under the hood: kdump works by reserving a separate chunk of memory at boot time (
crashkernel=parameter). When the kernel panics, the main kernel's memory is frozen and a second "crash kernel" boots from the reserved memory. This crash kernel has access to the dead kernel's memory and writes it to disk as a vmcore. The crash kernel needs its own initramfs, its own drivers, and enough reserved memory to boot. This is why kdump must be configured and tested in advance — it's a second kernel standing by for emergencies.
- Panicking at panics — rebooting without reading the message. The system panics, you see scary text on the console, and you immediately power cycle. The panic message told you exactly which kernel module caused the crash and what memory address was involved. That information is now gone because you didn't read it or capture it.
Fix: If the system has halted (panic=0), read the entire panic output on the console. Photograph it if you have to. If kdump captured a vmcore, analyze it before rebooting again. The panic message is your best diagnostic data — don't throw it away.
- Wrong crash kernel memory reservation.
kdump is installed but
crashkernel=64Mon a 256GB server. When the panic occurs, the crash kernel can't boot because 64M isn't enough memory. The crash dump isn't captured. You get the overhead of reserving memory plus zero benefit.
Fix: Size crashkernel appropriately: 256M for most servers (4-64GB
RAM), 512M for large systems (64GB-1TB), 1G for very large systems. After
configuring, test with echo c > /proc/sysrq-trigger during a maintenance
window to verify the dump is actually captured.
- SysRq disabled in production.
The system hangs. SSH is dead. You need to dump task states to diagnose the
hang, or do a clean reboot. But
kernel.sysrq=0because someone thought it was a "security risk." Now your only option is a hard power cycle via IPMI, losing all diagnostic data.
Fix: Enable SysRq in production: kernel.sysrq=1 (or a bitmask for
specific functions). SysRq requires physical console access or root-level
access to /proc/sysrq-trigger. If an attacker has either, SysRq is the
least of your problems.
- Treating OOM kills as random events instead of investigating root cause. The OOM killer kills your Java process. You restart it. It gets killed again next week. You add a cron job to restart it. The actual problem is a memory leak, or the JVM heap is sized at 80% of system RAM leaving nothing for the page cache.
Fix: When an OOM kill occurs, investigate: What was the process's RSS? Is it growing over time (leak)? Are resource limits correct? Is the system right-sized for its workload? Fix the memory pressure, don't just restart the victim.
- Running
fsckon a mounted filesystem. You see filesystem errors in dmesg and runfsck /dev/sda1while it's still mounted. This can cause catastrophic data corruption — fsck modifies filesystem metadata while the mounted kernel is also modifying it simultaneously.
Fix: Never run fsck on a mounted filesystem. Unmount first, or boot
into rescue mode. For root filesystem checks, use touch /forcefsck && reboot
or add fsck.mode=force to the kernel command line.
Gotcha: Some
fsckimplementations (particularlye2fsckon ext4) will print a warning "the filesystem is mounted" and refuse to run — but only in interactive mode. When scripted with-y(auto-yes), older versions would proceed anyway. Moderne2fsckis safer, butxfs_repairwill simply refuse if the filesystem is mounted. Never assume the tool will protect you — always checkmountoutput first.
- Ignoring soft lockup warnings. dmesg shows "BUG: soft lockup - CPU#3 stuck for 22s!" repeatedly. The system seems to recover each time, so it's filed under "weird but not critical." Soft lockups indicate a CPU was stuck in kernel code without yielding — often a precursor to a full hang.
Fix: Investigate soft lockups immediately. Check which function the CPU was stuck in (dmesg shows the backtrace). Common causes: buggy NIC/storage drivers, heavy interrupt load, hypervisor CPU steal on overcommitted hosts. Report to your vendor with the backtrace.
Debug clue: The soft lockup backtrace in
dmesgshows the function where the CPU was stuck. If it's in a network driver (e.g.,ixgbe_xmit_frame), the NIC or driver is the culprit. If it's in a filesystem function (e.g.,ext4_readdir), check disk health. If it's innative_queued_spin_lock_slowpath, a spinlock is contended — likely an overloaded system or a kernel bug. On VMs, checkstealtime intop— soft lockups on VMs often mean the hypervisor is starving the guest.
- Upgrading the kernel without testing kdump compatibility. You update the kernel. kdump was working before. After the reboot, the new kernel runs but kdump fails to start because the crash kernel image is incompatible or the debug symbols package wasn't updated.
Fix: After every kernel update, verify: kdumpctl status shows
operational, crash tool can open the correct vmlinux debug symbols.
Include kdump verification in your kernel update runbook. Install the
matching kernel-debuginfo package.
-
Using
echo c > /proc/sysrq-triggeroutside a maintenance window. You're testing kdump and trigger a real kernel panic on a production box serving traffic. The system goes down. If kdump fails, you've caused an outage with no diagnostic benefit whatsoever.Fix: Only test kdump crashes during scheduled maintenance windows with traffic drained. Use a staging or pre-production system for initial kdump testing. When testing on production, ensure the system is removed from the load balancer first.