Portal | Level: L3: Advanced | Topics: Kernel Troubleshooting, Linux Fundamentals, Filesystems & Storage | Domain: Linux

Kernel Troubleshooting - Primer¶

Why This Matters¶

The kernel is the layer between your applications and your hardware. When it misbehaves, everything above it — every process, every container, every database — is affected. Kernel issues manifest as mysterious hangs, random process deaths, hardware-related crashes, and "the machine is just weird" symptoms that application-level debugging can never explain.

Most engineers never look at dmesg until something is already on fire. This primer teaches you to read kernel messages, understand panics, capture crash dumps, and recover systems when the kernel itself is the problem.

The Kernel Ring Buffer (dmesg)¶

The kernel maintains an in-memory ring buffer of messages. This is your first stop for hardware issues, driver problems, and kernel-level errors.

Name origin: "dmesg" stands for "display message" (or "diagnostic message"). The ring buffer is literally a circular buffer in kernel memory -- when it fills up, new messages overwrite the oldest ones. Default size is typically 256 KB (configurable via log_buf_len boot parameter). On a busy system, early boot messages can be lost if you do not check quickly.

# Read the full ring buffer
dmesg

# With human-readable timestamps
dmesg -T

# Only errors and above
dmesg -l err,crit,alert,emerg

# Follow new messages in real-time
dmesg -w

# Clear and read (see only new messages)
dmesg -C && dmesg -w

Log Levels¶

 Level  Name      Meaning
 ─────  ────────  ──────────────────────────────────
 0      EMERG     System is unusable
 1      ALERT     Action must be taken immediately
 2      CRIT      Critical conditions
 3      ERR       Error conditions
 4      WARNING   Warning conditions
 5      NOTICE    Normal but significant
 6      INFO      Informational
 7      DEBUG     Debug-level messages

# Show only specific levels
dmesg -l warn         # Warnings only
dmesg -l err,crit     # Errors and criticals

What to Look For in dmesg¶

# Hardware errors
dmesg | grep -i -E "error|fault|fail|warn|oom|panic|bug"

# Memory issues
dmesg | grep -i -E "oom|out of memory|page allocation failure"

# Disk issues
dmesg | grep -i -E "i/o error|medium error|sector|ata|scsi"

# Network issues
dmesg | grep -i -E "link down|link up|carrier|dropped|reset"

# CPU/thermal
dmesg | grep -i -E "mce|machine check|thermal|throttl"

# Filesystem
dmesg | grep -i -E "ext4|xfs|corrupt|mount|remount"

Kernel Panic vs Oops¶

These are not the same thing, and the distinction matters:

Aspect	Oops	Panic
Severity	Error — kernel detected a bug	Fatal — kernel cannot continue
System state	Usually keeps running (degraded)	System halts or reboots
Recovery	May self-recover, process killed	Requires reboot
Data risk	Low (affected process dies)	Medium (dirty buffers may be lost)
Action	Investigate, may need reboot	Capture dump, investigate, reboot

 ┌─────────────────────────────────────────────┐
 │                Kernel Oops                    │
 │                                               │
 │  "BUG: unable to handle page fault at ..."   │
 │  - Stack trace printed to dmesg/console       │
 │  - Offending process killed                   │
 │  - Kernel continues (but may be unstable)     │
 │  - Kernel marked "tainted"                    │
 │                                               │
 │  IF panic_on_oops=1 → escalates to panic      │
 └─────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────┐
 │               Kernel Panic                    │
 │                                               │
 │  "Kernel panic - not syncing: ..."           │
 │  - System halts completely                    │
 │  - If kdump configured → crash dump captured  │
 │  - If panic=N set → auto-reboots after N sec  │
 │  - Console shows stack trace + register dump  │
 └─────────────────────────────────────────────┘

Tainted Kernels¶

When certain events occur, the kernel marks itself "tainted" to indicate it's no longer in a pristine state:

# Check taint status
cat /proc/sys/kernel/tainted
# 0 = clean
# Non-zero = tainted (each bit means something)

# Decode taint flags
dmesg | grep -i tainted

Common taint flags:

Bit  Flag  Meaning
───  ────  ───────────────────────────────────
 0    P    Proprietary module loaded (e.g., nvidia)
 1    F    Module force-loaded
 2    S    SMP kernel on non-SMP hardware
 6    U    User-requested taint (testing)
 9    A    ACPI table overridden
12    W    Warning occurred (kernel oops)
13    E    Unsigned module loaded

Why it matters: When filing a bug report or getting vendor support, a tainted kernel may affect whether you get help. Proprietary drivers (nvidia, vmware) are the most common taint source.

kdump — Capturing Crash Dumps¶

When the kernel panics, everything in memory is lost — unless you've configured kdump. kdump reserves a small amount of memory for a second kernel that activates during a panic and writes the crash dump to disk.

 Normal Operation               Panic Occurs
 ┌──────────────┐              ┌──────────────┐
 │ Main Kernel   │              │ Main Kernel   │
 │               │   panic!     │   (dead)      │
 │ ┌──────────┐ │  ────────▶   │ ┌──────────┐ │
 │ │ Reserved  │ │              │ │ Crash     │ │
 │ │ Memory    │ │              │ │ Kernel    │ │
 │ │ (kdump)   │ │              │ │ (active!) │ │
 │ └──────────┘ │              │ └──────┬───┘ │
 └──────────────┘              └────────┼─────┘
                                        │
                                        ▼
                               Writes vmcore to
                               /var/crash/

Setting Up kdump¶

# RHEL/CentOS
yum install kexec-tools
systemctl enable kdump
systemctl start kdump

# Verify it's running
systemctl status kdump
kdumpctl status

# Check reserved memory (should show crashkernel=XXM in cmdline)
cat /proc/cmdline | grep crashkernel
# crashkernel=256M  ← typical for systems with 4-64GB RAM

# If not set, add to GRUB:
grubby --update-kernel=ALL --args="crashkernel=256M"
# Reboot required for this to take effect

kdump Configuration¶

# /etc/kdump.conf
path /var/crash              # Where to save dumps
core_collector makedumpfile -l --message-level 7 -d 31
# -d 31 = dump level (what to exclude)
#   1  = zero pages
#   2  = cache pages
#   4  = cache private
#   8  = user pages
#  16  = free pages
#  31  = exclude all of the above (smallest dump)

Dump destinations:

# Local disk (default)
path /var/crash

# NFS
nfs nfs-server.example.com:/crash-dumps

# SSH
ssh user@crash-server.example.com
sshkey /root/.ssh/kdump_id_rsa
path /var/crash

crash — Analyzing Crash Dumps¶

The crash utility opens a vmcore and lets you examine the state of the kernel at the time of the panic:

# Install
yum install crash kernel-debuginfo

# Open a crash dump
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/127.0.0.1-*/vmcore

# Inside crash:
crash> bt           # Backtrace of the panicking task
crash> log          # Kernel log buffer at time of crash
crash> ps           # Process list at time of crash
crash> files <pid>  # Open files for a process
crash> vm <pid>     # Virtual memory layout
crash> sys          # System info (uptime, load, kernel version)
crash> mod          # Loaded modules
crash> kmem -i      # Memory usage summary
crash> exit

Reading a Backtrace¶

crash> bt
PID: 12345  TASK: ffff8881a2c3d000  CPU: 3  COMMAND: "nginx"
 #0 [ffff888198abfc50] machine_kexec at ffffffff8105c06a
 #1 [ffff888198abfca8] __crash_kexec at ffffffff8112b562
 #2 [ffff888198abfd70] panic at ffffffff81069e97
 #3 [ffff888198abfdf0] out_of_memory.cold at ffffffff812d6a22  ← root cause
 #4 [ffff888198abfe58] __alloc_pages_slowpath at ffffffff812a1b67
 #5 [ffff888198abff10] __alloc_pages at ffffffff812a1f2a

Read bottom-up. The lowest frame (#5) is where the kernel was when the problem started. The root cause is usually in the middle frames.

SysRq Magic Keys¶

SysRq is a kernel-level escape hatch that works even when the system is mostly unresponsive:

# Enable SysRq (check current)
cat /proc/sys/kernel/sysrq
# 0 = disabled, 1 = all enabled, or a bitmask

# Enable all SysRq functions
echo 1 > /proc/sys/kernel/sysrq

# Persistent:
# /etc/sysctl.d/90-sysrq.conf
kernel.sysrq = 1

Key Combinations¶

On a physical console: Alt + SysRq + <key> Via /proc: echo <key> > /proc/sysrq-trigger Via IPMI/iLO/iDRAC: send keyboard sequence

Key	Action	Use Case
b	Immediately reboot (no sync, no unmount)	Last resort
s	Sync all filesystems	Before emergency reboot
u	Remount all filesystems read-only	After sync, before reboot
e	Send SIGTERM to all processes (except init)	Graceful process shutdown
i	Send SIGKILL to all processes (except init)	Force process shutdown
o	Power off	When shutdown hangs
t	Dump task states to console	Debug hung processes
m	Dump memory info to console	Debug memory issues
w	Dump blocked (D-state) tasks	Debug I/O hangs

Remember: Mnemonic for SysRq keys: "BUSIER" spelled backwards is REISUB -- the safe reboot sequence. Some people remember it as "Reboot Even If System Utterly Broken."

The REISUB Sequence (Safe Reboot)¶

When a system is completely hung, the safe reboot sequence is:

R — un-Raw (take keyboard back from X)
E — tErminate (SIGTERM all processes)
I — kIll (SIGKILL all processes)
S — Sync (flush disk buffers)
U — Unmount (remount read-only)
B — reBoot

Wait 2-5 seconds between each key. This is the cleanest possible reboot when nothing else works.

# Remote equivalent (via SSH to a still-responsive shell):
echo s > /proc/sysrq-trigger
sleep 2
echo u > /proc/sysrq-trigger
sleep 2
echo b > /proc/sysrq-trigger

Common Kernel Issues¶

OOM Killer¶

When the system runs out of memory, the OOM killer selects and kills a process:

# Check for OOM kills
dmesg | grep -i "out of memory\|oom-killer\|killed process"

# See which process was killed
dmesg | grep "Killed process" | tail -5
# Killed process 12345 (java) total-vm:8388608kB, anon-rss:7340032kB

# Adjust OOM score for a process (lower = less likely to be killed)
echo -1000 > /proc/<pid>/oom_score_adj   # Never kill (use carefully)
echo 1000 > /proc/<pid>/oom_score_adj    # Kill first

Under the hood: The OOM killer scores processes using a formula based on RSS (resident memory), oom_score_adj, and whether the process is privileged. The score in /proc/<pid>/oom_score ranges from 0 to 1000. The process with the highest score dies first. Setting oom_score_adj to -1000 makes a process immune (used for critical daemons like sshd so you can still log in during an OOM event).

Debug clue: After an OOM kill, check dmesg for the full report. It shows every process's RSS and oom_score at the moment of the kill, which tells you exactly what was consuming memory. The line oom_kill_process names the victim. If you see order=0, the system needed a single page (4 KB) and could not find one -- the system was truly exhausted.

Machine Check Exceptions (MCE)¶

Hardware errors reported by the CPU:

# Check for MCE events
dmesg | grep -i "machine check\|mce"

# Install mcelog for detailed analysis
yum install mcelog
mcelog --client

# Common MCE causes:
# - Faulty RAM (run memtest86+)
# - Overheating CPU (check thermal sensors)
# - Failing CPU (hardware replacement needed)

Key Takeaways¶

dmesg -T -l err,crit,alert,emerg is your first diagnostic command for system-level issues.
Oops = kernel bug, process dies, system continues (degraded). Panic = kernel dies, system halts.
Configure kdump before you need it. A crash dump without kdump is lost forensic evidence.
SysRq (REISUB) is your emergency reboot when nothing else works. Enable it in production.
Read backtraces bottom-up. The root cause is in the middle, not at the top.
The OOM killer is a symptom, not a cause. Fix the memory pressure, don't just restart the killed process.
Tainted kernels may affect vendor support. Know your taint flags.

Prerequisites¶

Linux Ops (Topic Pack, L0)
Linux Performance Tuning (Topic Pack, L2)

Case Study: Disk Full Root Services Down (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
Case Study: Runaway Logs Fill Disk (Case Study, L1) — Filesystems & Storage, Linux Fundamentals
Deep Dive: Linux Filesystem Internals (deep_dive, L2) — Filesystems & Storage, Linux Fundamentals
Deep Dive: Linux Performance Debugging (deep_dive, L2) — Filesystems & Storage, Linux Fundamentals
Disk & Storage Ops (Topic Pack, L1) — Filesystems & Storage, Linux Fundamentals
/proc Filesystem (Topic Pack, L2) — Linux Fundamentals
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals

Kernel Troubleshooting - Primer¶

Why This Matters¶

The Kernel Ring Buffer (dmesg)¶

Log Levels¶

What to Look For in dmesg¶

Kernel Panic vs Oops¶

Tainted Kernels¶

kdump — Capturing Crash Dumps¶

Setting Up kdump¶

kdump Configuration¶

crash — Analyzing Crash Dumps¶

Reading a Backtrace¶

SysRq Magic Keys¶

Key Combinations¶

The REISUB Sequence (Safe Reboot)¶

Common Kernel Issues¶

OOM Killer¶

Machine Check Exceptions (MCE)¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Kernel Troubleshooting - Primer¶

Why This Matters¶

The Kernel Ring Buffer (dmesg)¶

Log Levels¶

What to Look For in dmesg¶

Kernel Panic vs Oops¶

Tainted Kernels¶

kdump — Capturing Crash Dumps¶

Setting Up kdump¶

kdump Configuration¶

crash — Analyzing Crash Dumps¶

Reading a Backtrace¶

SysRq Magic Keys¶

Key Combinations¶

The REISUB Sequence (Safe Reboot)¶

Common Kernel Issues¶

OOM Killer¶

Machine Check Exceptions (MCE)¶

Key Takeaways¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶