Kernel Troubleshooting Footguns¶

Ignoring dmesg warnings because "the server is running fine." dmesg shows EXT4-fs error messages and SMART warnings about reallocated sectors. The server is still serving traffic so nobody investigates. Two weeks later the disk fails, the filesystem is corrupted, and recovery takes a full day because the backups were on the same volume.

Fix: Monitor dmesg for errors as part of your standard health checks. Alert on any message at level err or above. Treat disk warnings (reallocated sectors, pending sectors, I/O errors) as urgent — they mean the disk is actively failing. Replace the disk, don't wait for full failure.

No kdump configured — discovering this during a panic. The kernel panics. You reboot. The only evidence is a vague "machine restarted unexpectedly" and whatever the console captured (often nothing if IPMI isn't configured). Without a crash dump, root cause analysis is pure guesswork.

Fix: Configure kdump on every production system as part of your base image. Verify it's running with kdumpctl status. Test it during initial provisioning by triggering a test panic in a maintenance window. A crash dump you can't capture is forensic evidence destroyed.

Under the hood: kdump works by reserving a separate chunk of memory at boot time (crashkernel= parameter). When the kernel panics, the main kernel's memory is frozen and a second "crash kernel" boots from the reserved memory. This crash kernel has access to the dead kernel's memory and writes it to disk as a vmcore. The crash kernel needs its own initramfs, its own drivers, and enough reserved memory to boot. This is why kdump must be configured and tested in advance — it's a second kernel standing by for emergencies.

Panicking at panics — rebooting without reading the message. The system panics, you see scary text on the console, and you immediately power cycle. The panic message told you exactly which kernel module caused the crash and what memory address was involved. That information is now gone because you didn't read it or capture it.

Fix: If the system has halted (panic=0), read the entire panic output on the console. Photograph it if you have to. If kdump captured a vmcore, analyze it before rebooting again. The panic message is your best diagnostic data — don't throw it away.

Wrong crash kernel memory reservation. kdump is installed but crashkernel=64M on a 256GB server. When the panic occurs, the crash kernel can't boot because 64M isn't enough memory. The crash dump isn't captured. You get the overhead of reserving memory plus zero benefit.

Fix: Size crashkernel appropriately: 256M for most servers (4-64GB RAM), 512M for large systems (64GB-1TB), 1G for very large systems. After configuring, test with echo c > /proc/sysrq-trigger during a maintenance window to verify the dump is actually captured.

SysRq disabled in production. The system hangs. SSH is dead. You need to dump task states to diagnose the hang, or do a clean reboot. But kernel.sysrq=0 because someone thought it was a "security risk." Now your only option is a hard power cycle via IPMI, losing all diagnostic data.

Fix: Enable SysRq in production: kernel.sysrq=1 (or a bitmask for specific functions). SysRq requires physical console access or root-level access to /proc/sysrq-trigger. If an attacker has either, SysRq is the least of your problems.

Treating OOM kills as random events instead of investigating root cause. The OOM killer kills your Java process. You restart it. It gets killed again next week. You add a cron job to restart it. The actual problem is a memory leak, or the JVM heap is sized at 80% of system RAM leaving nothing for the page cache.

Fix: When an OOM kill occurs, investigate: What was the process's RSS? Is it growing over time (leak)? Are resource limits correct? Is the system right-sized for its workload? Fix the memory pressure, don't just restart the victim.

Running fsck on a mounted filesystem. You see filesystem errors in dmesg and run fsck /dev/sda1 while it's still mounted. This can cause catastrophic data corruption — fsck modifies filesystem metadata while the mounted kernel is also modifying it simultaneously.

Fix: Never run fsck on a mounted filesystem. Unmount first, or boot into rescue mode. For root filesystem checks, use touch /forcefsck && reboot or add fsck.mode=force to the kernel command line.

Gotcha: Some fsck implementations (particularly e2fsck on ext4) will print a warning "the filesystem is mounted" and refuse to run — but only in interactive mode. When scripted with -y (auto-yes), older versions would proceed anyway. Modern e2fsck is safer, but xfs_repair will simply refuse if the filesystem is mounted. Never assume the tool will protect you — always check mount output first.

Ignoring soft lockup warnings. dmesg shows "BUG: soft lockup - CPU#3 stuck for 22s!" repeatedly. The system seems to recover each time, so it's filed under "weird but not critical." Soft lockups indicate a CPU was stuck in kernel code without yielding — often a precursor to a full hang.

Fix: Investigate soft lockups immediately. Check which function the CPU was stuck in (dmesg shows the backtrace). Common causes: buggy NIC/storage drivers, heavy interrupt load, hypervisor CPU steal on overcommitted hosts. Report to your vendor with the backtrace.

Debug clue: The soft lockup backtrace in dmesg shows the function where the CPU was stuck. If it's in a network driver (e.g., ixgbe_xmit_frame), the NIC or driver is the culprit. If it's in a filesystem function (e.g., ext4_readdir), check disk health. If it's in native_queued_spin_lock_slowpath, a spinlock is contended — likely an overloaded system or a kernel bug. On VMs, check steal time in top — soft lockups on VMs often mean the hypervisor is starving the guest.

Upgrading the kernel without testing kdump compatibility. You update the kernel. kdump was working before. After the reboot, the new kernel runs but kdump fails to start because the crash kernel image is incompatible or the debug symbols package wasn't updated.

Fix: After every kernel update, verify: kdumpctl status shows operational, crash tool can open the correct vmlinux debug symbols. Include kdump verification in your kernel update runbook. Install the matching kernel-debuginfo package.

Using echo c > /proc/sysrq-trigger outside a maintenance window. You're testing kdump and trigger a real kernel panic on a production box serving traffic. The system goes down. If kdump fails, you've caused an outage with no diagnostic benefit whatsoever.

Fix: Only test kdump crashes during scheduled maintenance windows with traffic drained. Use a staging or pre-production system for initial kdump testing. When testing on production, ensure the system is removed from the load balancer first.

Kernel Troubleshooting Footguns¶

Pages that link here¶