Skip to content

Kernel Troubleshooting

← Back to all decks

11 cards — 🟢 3 easy | 🟡 5 medium | 🔴 3 hard

🟢 Easy (3)

1. How do you view kernel messages with human-readable timestamps and filter for errors?

Show answer Use dmesg -T for human-readable timestamps. Filter for errors: dmesg -l err,crit,alert,emerg. Combine: dmesg -T -l err,crit,alert,emerg. Use dmesg -w to follow new messages in real-time. This is your first diagnostic command for system-level issues.

2. What patterns should you grep for in dmesg when troubleshooting hardware or system issues?

Show answer Hardware: "error|fault|fail|warn". Memory: "oom|out of memory|page allocation failure". Disk: "i/o error|medium error|sector|ata|scsi". Network: "link down|link up|carrier|dropped|reset". CPU: "mce|machine check|thermal|throttl". Filesystem: "ext4|xfs|corrupt|mount|remount".

3. What is the difference between a kernel oops and a kernel panic?

Show answer An oops is a kernel bug that kills the offending process but the system usually continues running (degraded, marked tainted). A panic is fatal -- the kernel cannot continue and the system halts or reboots. An oops can escalate to a panic if panic_on_oops=1 is set.

🟡 Medium (5)

1. What is kdump and how does it capture crash dumps during a kernel panic?

Show answer kdump reserves a small amount of memory at boot for a second (crash) kernel. During a panic, the crash kernel activates and writes the contents of memory (vmcore) to disk at /var/crash/. Setup: install kexec-tools, enable kdump service, ensure crashkernel=256M is in the kernel command line. Without kdump, crash forensic evidence is lost.

2. What is the SysRq REISUB sequence, and when would you use it?

Show answer REISUB is a safe emergency reboot sequence when the system is hung: R (un-Raw keyboard), E (tErminate all, SIGTERM), I (kIll all, SIGKILL), S (Sync disks), U (Unmount/remount read-only), B (reBoot). Wait 2-5 seconds between each key. This is the cleanest reboot when nothing else works. Enable with kernel.sysrq=1 in /etc/sysctl.d/.

3. What does it mean when a kernel is "tainted," and why does it matter?

Show answer A tainted kernel has been modified from its pristine state. Common taint flags: P (proprietary module like nvidia), F (module force-loaded), W (warning/oops occurred), E (unsigned module). Check with cat /proc/sys/kernel/tainted (0=clean). Tainted kernels may affect vendor support and bug report acceptance.

4. How does the OOM killer work, and how do you investigate OOM kill events?

Show answer When the system runs out of memory, the OOM killer selects and kills a process to free memory. Check for OOM kills: dmesg | grep -i "out of memory\|oom-killer\|killed process". Adjust OOM priority: echo -1000 > /proc//oom_score_adj (never kill) or echo 1000 (kill first). The OOM killer is a symptom -- fix the memory pressure, do not just restart the killed process.

5. What SysRq keys help debug hung systems without rebooting?

Show answer t = dump task states (debug hung processes), m = dump memory info (debug memory issues), w = dump blocked D-state tasks (debug I/O hangs), s = sync all filesystems, e = send SIGTERM to all processes. Access via Alt+SysRq+ on console, or echo > /proc/sysrq-trigger remotely. These work even when the system is mostly unresponsive.

🔴 Hard (3)

1. How do you analyze a kernel crash dump using the crash utility, and how do you read a backtrace?

Show answer Install crash and kernel-debuginfo. Open dump: crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/*/vmcore. Key commands: bt (backtrace), log (kernel log at crash time), ps (process list), sys (system info). Read backtraces bottom-up: the lowest frame is where the problem started, the root cause is usually in the middle frames.

2. What are Machine Check Exceptions (MCEs), and how do you diagnose them?

Show answer MCEs are hardware errors reported by the CPU. Common causes: faulty RAM (diagnose with memtest86+), overheating CPU (check thermal sensors), or failing CPU (needs replacement). Check with dmesg | grep -i "machine check\|mce". Install mcelog for detailed analysis. MCEs indicate real hardware problems that cannot be fixed with software.

3. How do you configure kdump for remote crash dump storage, and what does the dump level (-d flag) control?

Show answer Configure in /etc/kdump.conf. For NFS: nfs server:/crash-dumps. For SSH: ssh user@server with sshkey path. The core_collector makedumpfile -d 31 flag controls what to exclude from the dump: 1=zero pages, 2=cache pages, 4=cache private, 8=user pages, 16=free pages. -d 31 excludes all (smallest dump). Lower values produce larger but more complete dumps.