Skip to content

Portal | Level: L3: Advanced | Topics: Kernel Troubleshooting, Linux Fundamentals, Filesystems & Storage | Domain: Linux

Kernel Troubleshooting - Primer

Why This Matters

The kernel is the layer between your applications and your hardware. When it misbehaves, everything above it — every process, every container, every database — is affected. Kernel issues manifest as mysterious hangs, random process deaths, hardware-related crashes, and "the machine is just weird" symptoms that application-level debugging can never explain.

Most engineers never look at dmesg until something is already on fire. This primer teaches you to read kernel messages, understand panics, capture crash dumps, and recover systems when the kernel itself is the problem.


The Kernel Ring Buffer (dmesg)

The kernel maintains an in-memory ring buffer of messages. This is your first stop for hardware issues, driver problems, and kernel-level errors.

Name origin: "dmesg" stands for "display message" (or "diagnostic message"). The ring buffer is literally a circular buffer in kernel memory -- when it fills up, new messages overwrite the oldest ones. Default size is typically 256 KB (configurable via log_buf_len boot parameter). On a busy system, early boot messages can be lost if you do not check quickly.

# Read the full ring buffer
dmesg

# With human-readable timestamps
dmesg -T

# Only errors and above
dmesg -l err,crit,alert,emerg

# Follow new messages in real-time
dmesg -w

# Clear and read (see only new messages)
dmesg -C && dmesg -w

Log Levels

 Level  Name      Meaning
 ─────  ────────  ──────────────────────────────────
 0      EMERG     System is unusable
 1      ALERT     Action must be taken immediately
 2      CRIT      Critical conditions
 3      ERR       Error conditions
 4      WARNING   Warning conditions
 5      NOTICE    Normal but significant
 6      INFO      Informational
 7      DEBUG     Debug-level messages
# Show only specific levels
dmesg -l warn         # Warnings only
dmesg -l err,crit     # Errors and criticals

What to Look For in dmesg

# Hardware errors
dmesg | grep -i -E "error|fault|fail|warn|oom|panic|bug"

# Memory issues
dmesg | grep -i -E "oom|out of memory|page allocation failure"

# Disk issues
dmesg | grep -i -E "i/o error|medium error|sector|ata|scsi"

# Network issues
dmesg | grep -i -E "link down|link up|carrier|dropped|reset"

# CPU/thermal
dmesg | grep -i -E "mce|machine check|thermal|throttl"

# Filesystem
dmesg | grep -i -E "ext4|xfs|corrupt|mount|remount"

Kernel Panic vs Oops

These are not the same thing, and the distinction matters:

Aspect Oops Panic
Severity Error — kernel detected a bug Fatal — kernel cannot continue
System state Usually keeps running (degraded) System halts or reboots
Recovery May self-recover, process killed Requires reboot
Data risk Low (affected process dies) Medium (dirty buffers may be lost)
Action Investigate, may need reboot Capture dump, investigate, reboot
 ┌─────────────────────────────────────────────┐
 │                Kernel Oops                    │
 │                                               │
 │  "BUG: unable to handle page fault at ..."   │
 │  - Stack trace printed to dmesg/console       │
 │  - Offending process killed                   │
 │  - Kernel continues (but may be unstable)     │
 │  - Kernel marked "tainted"                    │
 │                                               │
 │  IF panic_on_oops=1 → escalates to panic      │
 └─────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────┐
 │               Kernel Panic                    │
 │                                               │
 │  "Kernel panic - not syncing: ..."           │
 │  - System halts completely                    │
 │  - If kdump configured → crash dump captured  │
 │  - If panic=N set → auto-reboots after N sec  │
 │  - Console shows stack trace + register dump  │
 └─────────────────────────────────────────────┘

Tainted Kernels

When certain events occur, the kernel marks itself "tainted" to indicate it's no longer in a pristine state:

# Check taint status
cat /proc/sys/kernel/tainted
# 0 = clean
# Non-zero = tainted (each bit means something)

# Decode taint flags
dmesg | grep -i tainted

Common taint flags:

Bit  Flag  Meaning
───  ────  ───────────────────────────────────
 0    P    Proprietary module loaded (e.g., nvidia)
 1    F    Module force-loaded
 2    S    SMP kernel on non-SMP hardware
 6    U    User-requested taint (testing)
 9    A    ACPI table overridden
12    W    Warning occurred (kernel oops)
13    E    Unsigned module loaded

Why it matters: When filing a bug report or getting vendor support, a tainted kernel may affect whether you get help. Proprietary drivers (nvidia, vmware) are the most common taint source.


kdump — Capturing Crash Dumps

When the kernel panics, everything in memory is lost — unless you've configured kdump. kdump reserves a small amount of memory for a second kernel that activates during a panic and writes the crash dump to disk.

 Normal Operation               Panic Occurs
 ┌──────────────┐              ┌──────────────┐
 │ Main Kernel   │              │ Main Kernel   │
 │               │   panic!     │   (dead)      │
 │ ┌──────────┐ │  ────────▶   │ ┌──────────┐ │
 │ │ Reserved  │ │              │ │ Crash     │ │
 │ │ Memory    │ │              │ │ Kernel    │ │
 │ │ (kdump)   │ │              │ │ (active!) │ │
 │ └──────────┘ │              │ └──────┬───┘ │
 └──────────────┘              └────────┼─────┘
                               Writes vmcore to
                               /var/crash/

Setting Up kdump

# RHEL/CentOS
yum install kexec-tools
systemctl enable kdump
systemctl start kdump

# Verify it's running
systemctl status kdump
kdumpctl status

# Check reserved memory (should show crashkernel=XXM in cmdline)
cat /proc/cmdline | grep crashkernel
# crashkernel=256M  ← typical for systems with 4-64GB RAM

# If not set, add to GRUB:
grubby --update-kernel=ALL --args="crashkernel=256M"
# Reboot required for this to take effect

kdump Configuration

# /etc/kdump.conf
path /var/crash              # Where to save dumps
core_collector makedumpfile -l --message-level 7 -d 31
# -d 31 = dump level (what to exclude)
#   1  = zero pages
#   2  = cache pages
#   4  = cache private
#   8  = user pages
#  16  = free pages
#  31  = exclude all of the above (smallest dump)

Dump destinations:

# Local disk (default)
path /var/crash

# NFS
nfs nfs-server.example.com:/crash-dumps

# SSH
ssh user@crash-server.example.com
sshkey /root/.ssh/kdump_id_rsa
path /var/crash


crash — Analyzing Crash Dumps

The crash utility opens a vmcore and lets you examine the state of the kernel at the time of the panic:

# Install
yum install crash kernel-debuginfo

# Open a crash dump
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/127.0.0.1-*/vmcore

# Inside crash:
crash> bt           # Backtrace of the panicking task
crash> log          # Kernel log buffer at time of crash
crash> ps           # Process list at time of crash
crash> files <pid>  # Open files for a process
crash> vm <pid>     # Virtual memory layout
crash> sys          # System info (uptime, load, kernel version)
crash> mod          # Loaded modules
crash> kmem -i      # Memory usage summary
crash> exit

Reading a Backtrace

crash> bt
PID: 12345  TASK: ffff8881a2c3d000  CPU: 3  COMMAND: "nginx"
 #0 [ffff888198abfc50] machine_kexec at ffffffff8105c06a
 #1 [ffff888198abfca8] __crash_kexec at ffffffff8112b562
 #2 [ffff888198abfd70] panic at ffffffff81069e97
 #3 [ffff888198abfdf0] out_of_memory.cold at ffffffff812d6a22  ← root cause
 #4 [ffff888198abfe58] __alloc_pages_slowpath at ffffffff812a1b67
 #5 [ffff888198abff10] __alloc_pages at ffffffff812a1f2a

Read bottom-up. The lowest frame (#5) is where the kernel was when the problem started. The root cause is usually in the middle frames.


SysRq Magic Keys

SysRq is a kernel-level escape hatch that works even when the system is mostly unresponsive:

# Enable SysRq (check current)
cat /proc/sys/kernel/sysrq
# 0 = disabled, 1 = all enabled, or a bitmask

# Enable all SysRq functions
echo 1 > /proc/sys/kernel/sysrq

# Persistent:
# /etc/sysctl.d/90-sysrq.conf
kernel.sysrq = 1

Key Combinations

On a physical console: Alt + SysRq + <key> Via /proc: echo <key> > /proc/sysrq-trigger Via IPMI/iLO/iDRAC: send keyboard sequence

Key Action Use Case
b Immediately reboot (no sync, no unmount) Last resort
s Sync all filesystems Before emergency reboot
u Remount all filesystems read-only After sync, before reboot
e Send SIGTERM to all processes (except init) Graceful process shutdown
i Send SIGKILL to all processes (except init) Force process shutdown
o Power off When shutdown hangs
t Dump task states to console Debug hung processes
m Dump memory info to console Debug memory issues
w Dump blocked (D-state) tasks Debug I/O hangs

Remember: Mnemonic for SysRq keys: "BUSIER" spelled backwards is REISUB -- the safe reboot sequence. Some people remember it as "Reboot Even If System Utterly Broken."

The REISUB Sequence (Safe Reboot)

When a system is completely hung, the safe reboot sequence is:

R — un-Raw (take keyboard back from X)
E — tErminate (SIGTERM all processes)
I — kIll (SIGKILL all processes)
S — Sync (flush disk buffers)
U — Unmount (remount read-only)
B — reBoot

Wait 2-5 seconds between each key. This is the cleanest possible reboot when nothing else works.

# Remote equivalent (via SSH to a still-responsive shell):
echo s > /proc/sysrq-trigger
sleep 2
echo u > /proc/sysrq-trigger
sleep 2
echo b > /proc/sysrq-trigger

Common Kernel Issues

OOM Killer

When the system runs out of memory, the OOM killer selects and kills a process:

# Check for OOM kills
dmesg | grep -i "out of memory\|oom-killer\|killed process"

# See which process was killed
dmesg | grep "Killed process" | tail -5
# Killed process 12345 (java) total-vm:8388608kB, anon-rss:7340032kB

# Adjust OOM score for a process (lower = less likely to be killed)
echo -1000 > /proc/<pid>/oom_score_adj   # Never kill (use carefully)
echo 1000 > /proc/<pid>/oom_score_adj    # Kill first

Under the hood: The OOM killer scores processes using a formula based on RSS (resident memory), oom_score_adj, and whether the process is privileged. The score in /proc/<pid>/oom_score ranges from 0 to 1000. The process with the highest score dies first. Setting oom_score_adj to -1000 makes a process immune (used for critical daemons like sshd so you can still log in during an OOM event).

Debug clue: After an OOM kill, check dmesg for the full report. It shows every process's RSS and oom_score at the moment of the kill, which tells you exactly what was consuming memory. The line oom_kill_process names the victim. If you see order=0, the system needed a single page (4 KB) and could not find one -- the system was truly exhausted.


Machine Check Exceptions (MCE)

Hardware errors reported by the CPU:

# Check for MCE events
dmesg | grep -i "machine check\|mce"

# Install mcelog for detailed analysis
yum install mcelog
mcelog --client

# Common MCE causes:
# - Faulty RAM (run memtest86+)
# - Overheating CPU (check thermal sensors)
# - Failing CPU (hardware replacement needed)

Key Takeaways

  1. dmesg -T -l err,crit,alert,emerg is your first diagnostic command for system-level issues.
  2. Oops = kernel bug, process dies, system continues (degraded). Panic = kernel dies, system halts.
  3. Configure kdump before you need it. A crash dump without kdump is lost forensic evidence.
  4. SysRq (REISUB) is your emergency reboot when nothing else works. Enable it in production.
  5. Read backtraces bottom-up. The root cause is in the middle, not at the top.
  6. The OOM killer is a symptom, not a cause. Fix the memory pressure, don't just restart the killed process.
  7. Tainted kernels may affect vendor support. Know your taint flags.

Wiki Navigation

Prerequisites