Kernel Troubleshooting - Street-Level Ops¶

Quick Diagnosis Commands¶

# ── Kernel Version and Status ──
uname -r                                  # Running kernel version
cat /proc/version                         # Full kernel version string
cat /proc/sys/kernel/tainted              # Taint status (0 = clean)
uptime                                    # System uptime + load averages

# ── Recent Kernel Messages ──
dmesg -T -l err,crit,alert,emerg | tail -30    # Recent errors
dmesg -T | tail -50                             # Last 50 messages (all levels)
journalctl -k --since "1 hour ago" --no-pager   # Kernel journal (systemd)

# ── Memory State ──
free -h                                   # Memory summary
cat /proc/meminfo | head -20              # Detailed memory info
cat /proc/buddyinfo                       # Memory fragmentation
slabtop -o -s c | head -20               # Top kernel slab consumers

# ── Process State ──
ps aux --sort=-%mem | head -10            # Top memory consumers
ps aux --sort=-%cpu | head -10            # Top CPU consumers
cat /proc/loadavg                         # Load averages + running tasks

# ── Blocked Processes (D-state) ──
ps aux | awk '$8 ~ /D/ {print}'           # Processes in uninterruptible sleep
cat /proc/<pid>/stack                     # Kernel stack of a stuck process

# ── Hardware Errors ──
dmesg | grep -i -E "mce|machine check|hardware error"
mcelog --client 2>/dev/null               # MCE log (if mcelog installed)
rasdaemon --record 2>/dev/null            # Hardware error recording

# ── kdump Status ──
systemctl status kdump                    # kdump service status
kdumpctl status                           # Detailed kdump status
ls /var/crash/                            # Previous crash dumps

Pattern: dmesg Pattern Matching for Common Issues¶

# OOM kills
dmesg -T | grep -A5 "Out of memory"
# Look for: which process was killed, how much memory it was using

# Disk I/O errors (bad disk, failing controller)
dmesg -T | grep -i -E "i/o error|medium error|sector|ata.*error|scsi.*error"
# Look for: device names (sda, sdb), sector numbers (repeated = bad block)

# Filesystem errors
dmesg -T | grep -i -E "ext4.*error|xfs.*error|corrupt|read-only"
# Look for: "Remounting filesystem read-only" = filesystem detected corruption

# Network driver issues
dmesg -T | grep -i -E "eth0|ens|link is down|carrier|watchdog|reset"
# Look for: repeated "link down/up" = cable or switch issue

# USB/hardware changes
dmesg -T | grep -i -E "usb|new device|disconnect"

# Thermal issues
dmesg -T | grep -i -E "thermal|throttl|temperature"
# CPU throttling = overheating = check fans, airflow, ambient temp

Debug clue: If dmesg shows soft lockup messages, the CPU was stuck in kernel code for more than the watchdog threshold (default 20 seconds). In VMs this is often a hypervisor scheduling stall, not a real kernel bug. Check if the host is oversubscribed before blaming the kernel module.

# Soft lockups (CPU stuck in kernel code)
dmesg -T | grep -i "soft lockup"
# Usually indicates driver bug or heavy I/O under load

# RCU stalls (kernel synchronization delays)
dmesg -T | grep -i "rcu.*stall"
# Often caused by: high IRQ load, buggy kernel module, hypervisor issues

Gotcha: "Remounting Filesystem Read-Only"¶

You see this in dmesg and suddenly applications can't write:

# Diagnose
dmesg -T | grep -i "remount"
# EXT4-fs (sda1): Remounting filesystem read-only

# This means the filesystem detected corruption and protected itself
# Check the disk
smartctl -a /dev/sda | grep -E "Reallocated|Current_Pending|Offline_Uncorrectable"

# If SMART is OK, it may be a filesystem-level issue
# For ext4:
# DO NOT run fsck on a mounted filesystem
# Schedule check on next reboot:
touch /forcefsck
reboot

# For XFS:
xfs_repair -n /dev/sda1   # -n = dry run (check only)
# If errors found, unmount first, then:
umount /dev/sda1
xfs_repair /dev/sda1
mount /dev/sda1

Pattern: kdump Capture and Analysis Workflow¶

# === After a panic, if kdump was configured ===

# 1. Find the crash dump
ls -la /var/crash/
# 127.0.0.1-2026-03-15-03:42:17/
#   vmcore          ← the crash dump
#   vmcore-dmesg.txt ← kernel log at time of crash

# 2. Read the dmesg from the crash (no crash tool needed)
cat /var/crash/127.0.0.1-*/vmcore-dmesg.txt | tail -50
# Often the root cause is visible here

# 3. For deeper analysis, use crash
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
  /var/crash/127.0.0.1-*/vmcore

# Inside crash:
crash> log                # Full kernel log
crash> bt                 # Backtrace of panicking task
crash> bt -a              # Backtrace of ALL CPUs
crash> ps -m              # Process list with memory usage
crash> sys                # System info
crash> mod                # Loaded modules at crash time

# 4. Common patterns to check
crash> log | grep -i panic   # What triggered the panic
crash> log | grep -i oom     # OOM kill involved?
crash> log | grep -i bug     # Kernel BUG() hit?
crash> bt | grep -i "module_name"  # Was a specific module involved?

Gotcha: D-State Processes (Uninterruptible Sleep)¶

Processes in D-state can't be killed (not even with SIGKILL). They're waiting for I/O that isn't completing.

# Find D-state processes
ps aux | awk '$8 ~ /D/'

# See what they're waiting on
cat /proc/<pid>/stack
# Typical output:
# [<0>] nfs4_wait_clnt_recover+0x32/0x50 [nfsv4]
# ← this tells you it's waiting on NFS recovery

# Common D-state causes:
# - NFS server unreachable (most common)
# - Disk I/O failure (bad disk, controller issue)
# - Kernel driver bug
# - iSCSI target unreachable

# For NFS-related D-state:
# Check NFS server connectivity
showmount -e nfs-server
# Check mount status
mount | grep nfs
# If NFS server is down, you may need to force-unmount
umount -f /mnt/nfs-share    # Force unmount
umount -l /mnt/nfs-share    # Lazy unmount (detach now, clean up later)

Pattern: SysRq Remote Recovery¶

When a system is hung but you have IPMI/iLO/iDRAC/SSH access:

# Via SSH (if SSH is still responsive but the system is otherwise hung)
# Step 1: Dump task states to understand the hang
echo t > /proc/sysrq-trigger
dmesg | tail -100  # Read the task dump

# Step 2: Dump blocked (D-state) tasks specifically
echo w > /proc/sysrq-trigger
dmesg | tail -50

# Step 3: If you need to reboot, do it safely
echo s > /proc/sysrq-trigger  # Sync
sleep 5
echo u > /proc/sysrq-trigger  # Remount read-only
sleep 2
echo b > /proc/sysrq-trigger  # Reboot

# Via IPMI (if SSH is dead)
ipmitool -I lanplus -H bmc-ip -U admin -P password chassis power cycle
# Last resort — no sync/unmount possible via IPMI power cycle

Pattern: Kernel Module Debugging¶

When you suspect a kernel module is causing issues:

# List loaded modules
lsmod

# Module details
modinfo <module_name>

# Module parameters (live)
systool -vm <module_name>

# Module-specific dmesg messages
dmesg | grep -i <module_name>

# Unload a problematic module (if not in use)
modprobe -r <module_name>

# Blacklist a module permanently
echo "blacklist <module_name>" > /etc/modprobe.d/blacklist-problem.conf
# Rebuild initramfs
dracut -f  # RHEL/CentOS
update-initramfs -u  # Debian/Ubuntu

# Load a module with debug parameters
modprobe <module_name> debug=1

Gotcha: Kernel Panic Auto-Reboot Loop¶

The system panics, reboots (panic=N is set), hits the same bug, panics again. Infinite loop. You can't SSH in because it reboots every 30 seconds.

# At GRUB menu (via IPMI console):
# Edit kernel line, add:
panic=0
# This makes the system HALT on panic instead of reboot
# Now you can read the panic message on the console

# If the panic is caused by a module:
# Add to kernel line:
modprobe.blacklist=problematic_module

# If the panic is during boot:
# Boot into rescue mode from GRUB
# Or add: systemd.unit=rescue.target

# If you can't get to GRUB:
# Boot from rescue media (installation ISO in rescue mode)

Pattern: OOM Kill Investigation¶

# 1. Find recent OOM kills
dmesg -T | grep -B5 -A15 "Out of memory"

# Key lines to look for:
# "Out of memory: Kill process 12345 (java) score 820 or sacrifice child"
# "Killed process 12345 (java) total-vm:8388608kB, anon-rss:7340032kB"
# ^ total-vm = virtual, anon-rss = actual physical memory used

Under the hood: The OOM killer scores processes by RSS size relative to total memory, then adjusts by oom_score_adj. A process with oom_score_adj=-1000 is immune; +1000 is always killed first. Kubernetes sets oom_score_adj based on QoS class: Guaranteed=-997, BestEffort=1000, Burstable=scaled by request ratio. This is why BestEffort pods die first.

# 2. Check what triggered OOM (memory allocation that failed)
dmesg -T | grep "invoked oom-killer"
# "nginx invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)"

# 3. Check current memory pressure
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Committed"

# 4. Find current top memory consumers
ps aux --sort=-%mem | head -10

# 5. Check OOM scores (higher = more likely to be killed)
for pid in $(ls /proc/ | grep -E '^[0-9]+$' | head -50); do
  name=$(cat /proc/$pid/comm 2>/dev/null)
  score=$(cat /proc/$pid/oom_score 2>/dev/null)
  [ -n "$score" ] && [ "$score" -gt 100 ] && echo "$pid $name $score"
done | sort -k3 -rn | head -10

# 6. Protect critical processes
echo -1000 > /proc/$(pgrep -f "critical-process")/oom_score_adj

Gotcha: Wrong crashkernel Size¶

kdump is configured but fails to capture a dump because the reserved memory is too small for the crash kernel to boot.

# Check current reservation
cat /proc/cmdline | grep crashkernel
cat /proc/iomem | grep -i crash

# Recommended sizes:
# < 4GB RAM:    crashkernel=128M
# 4-64GB RAM:   crashkernel=256M
# 64-1TB RAM:   crashkernel=512M
# > 1TB RAM:    crashkernel=1G

# Test kdump capture (WILL CRASH THE SYSTEM)
# Only do this during a maintenance window!
echo c > /proc/sysrq-trigger
# System will panic, kdump should capture, then reboot
# Check /var/crash/ after reboot

Pattern: Monitoring Kernel Health Proactively¶

# Add to monitoring/cron:

# Check for new dmesg errors every 5 minutes
*/5 * * * * dmesg -T -l err,crit,alert,emerg | tail -5 | \
  logger -t kernel-monitor

# Check for taint
cat /proc/sys/kernel/tainted  # Alert if non-zero unexpectedly

# Check for soft lockups
dmesg | grep -c "soft lockup"  # Alert if > 0

# Check for OOM kills
dmesg | grep -c "Killed process"  # Alert if > 0

# Check kdump readiness
kdumpctl status | grep -q "operational" || echo "KDUMP NOT READY"