Kernel Troubleshooting - Street-Level Ops¶
Quick Diagnosis Commands¶
# ── Kernel Version and Status ──
uname -r # Running kernel version
cat /proc/version # Full kernel version string
cat /proc/sys/kernel/tainted # Taint status (0 = clean)
uptime # System uptime + load averages
# ── Recent Kernel Messages ──
dmesg -T -l err,crit,alert,emerg | tail -30 # Recent errors
dmesg -T | tail -50 # Last 50 messages (all levels)
journalctl -k --since "1 hour ago" --no-pager # Kernel journal (systemd)
# ── Memory State ──
free -h # Memory summary
cat /proc/meminfo | head -20 # Detailed memory info
cat /proc/buddyinfo # Memory fragmentation
slabtop -o -s c | head -20 # Top kernel slab consumers
# ── Process State ──
ps aux --sort=-%mem | head -10 # Top memory consumers
ps aux --sort=-%cpu | head -10 # Top CPU consumers
cat /proc/loadavg # Load averages + running tasks
# ── Blocked Processes (D-state) ──
ps aux | awk '$8 ~ /D/ {print}' # Processes in uninterruptible sleep
cat /proc/<pid>/stack # Kernel stack of a stuck process
# ── Hardware Errors ──
dmesg | grep -i -E "mce|machine check|hardware error"
mcelog --client 2>/dev/null # MCE log (if mcelog installed)
rasdaemon --record 2>/dev/null # Hardware error recording
# ── kdump Status ──
systemctl status kdump # kdump service status
kdumpctl status # Detailed kdump status
ls /var/crash/ # Previous crash dumps
Pattern: dmesg Pattern Matching for Common Issues¶
# OOM kills
dmesg -T | grep -A5 "Out of memory"
# Look for: which process was killed, how much memory it was using
# Disk I/O errors (bad disk, failing controller)
dmesg -T | grep -i -E "i/o error|medium error|sector|ata.*error|scsi.*error"
# Look for: device names (sda, sdb), sector numbers (repeated = bad block)
# Filesystem errors
dmesg -T | grep -i -E "ext4.*error|xfs.*error|corrupt|read-only"
# Look for: "Remounting filesystem read-only" = filesystem detected corruption
# Network driver issues
dmesg -T | grep -i -E "eth0|ens|link is down|carrier|watchdog|reset"
# Look for: repeated "link down/up" = cable or switch issue
# USB/hardware changes
dmesg -T | grep -i -E "usb|new device|disconnect"
# Thermal issues
dmesg -T | grep -i -E "thermal|throttl|temperature"
# CPU throttling = overheating = check fans, airflow, ambient temp
Debug clue: If dmesg shows
soft lockupmessages, the CPU was stuck in kernel code for more than the watchdog threshold (default 20 seconds). In VMs this is often a hypervisor scheduling stall, not a real kernel bug. Check if the host is oversubscribed before blaming the kernel module.
# Soft lockups (CPU stuck in kernel code)
dmesg -T | grep -i "soft lockup"
# Usually indicates driver bug or heavy I/O under load
# RCU stalls (kernel synchronization delays)
dmesg -T | grep -i "rcu.*stall"
# Often caused by: high IRQ load, buggy kernel module, hypervisor issues
Gotcha: "Remounting Filesystem Read-Only"¶
You see this in dmesg and suddenly applications can't write:
# Diagnose
dmesg -T | grep -i "remount"
# EXT4-fs (sda1): Remounting filesystem read-only
# This means the filesystem detected corruption and protected itself
# Check the disk
smartctl -a /dev/sda | grep -E "Reallocated|Current_Pending|Offline_Uncorrectable"
# If SMART is OK, it may be a filesystem-level issue
# For ext4:
# DO NOT run fsck on a mounted filesystem
# Schedule check on next reboot:
touch /forcefsck
reboot
# For XFS:
xfs_repair -n /dev/sda1 # -n = dry run (check only)
# If errors found, unmount first, then:
umount /dev/sda1
xfs_repair /dev/sda1
mount /dev/sda1
Pattern: kdump Capture and Analysis Workflow¶
# === After a panic, if kdump was configured ===
# 1. Find the crash dump
ls -la /var/crash/
# 127.0.0.1-2026-03-15-03:42:17/
# vmcore ← the crash dump
# vmcore-dmesg.txt ← kernel log at time of crash
# 2. Read the dmesg from the crash (no crash tool needed)
cat /var/crash/127.0.0.1-*/vmcore-dmesg.txt | tail -50
# Often the root cause is visible here
# 3. For deeper analysis, use crash
crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
/var/crash/127.0.0.1-*/vmcore
# Inside crash:
crash> log # Full kernel log
crash> bt # Backtrace of panicking task
crash> bt -a # Backtrace of ALL CPUs
crash> ps -m # Process list with memory usage
crash> sys # System info
crash> mod # Loaded modules at crash time
# 4. Common patterns to check
crash> log | grep -i panic # What triggered the panic
crash> log | grep -i oom # OOM kill involved?
crash> log | grep -i bug # Kernel BUG() hit?
crash> bt | grep -i "module_name" # Was a specific module involved?
Gotcha: D-State Processes (Uninterruptible Sleep)¶
Processes in D-state can't be killed (not even with SIGKILL). They're waiting for I/O that isn't completing.
# Find D-state processes
ps aux | awk '$8 ~ /D/'
# See what they're waiting on
cat /proc/<pid>/stack
# Typical output:
# [<0>] nfs4_wait_clnt_recover+0x32/0x50 [nfsv4]
# ← this tells you it's waiting on NFS recovery
# Common D-state causes:
# - NFS server unreachable (most common)
# - Disk I/O failure (bad disk, controller issue)
# - Kernel driver bug
# - iSCSI target unreachable
# For NFS-related D-state:
# Check NFS server connectivity
showmount -e nfs-server
# Check mount status
mount | grep nfs
# If NFS server is down, you may need to force-unmount
umount -f /mnt/nfs-share # Force unmount
umount -l /mnt/nfs-share # Lazy unmount (detach now, clean up later)
Pattern: SysRq Remote Recovery¶
When a system is hung but you have IPMI/iLO/iDRAC/SSH access:
# Via SSH (if SSH is still responsive but the system is otherwise hung)
# Step 1: Dump task states to understand the hang
echo t > /proc/sysrq-trigger
dmesg | tail -100 # Read the task dump
# Step 2: Dump blocked (D-state) tasks specifically
echo w > /proc/sysrq-trigger
dmesg | tail -50
# Step 3: If you need to reboot, do it safely
echo s > /proc/sysrq-trigger # Sync
sleep 5
echo u > /proc/sysrq-trigger # Remount read-only
sleep 2
echo b > /proc/sysrq-trigger # Reboot
# Via IPMI (if SSH is dead)
ipmitool -I lanplus -H bmc-ip -U admin -P password chassis power cycle
# Last resort — no sync/unmount possible via IPMI power cycle
Pattern: Kernel Module Debugging¶
When you suspect a kernel module is causing issues:
# List loaded modules
lsmod
# Module details
modinfo <module_name>
# Module parameters (live)
systool -vm <module_name>
# Module-specific dmesg messages
dmesg | grep -i <module_name>
# Unload a problematic module (if not in use)
modprobe -r <module_name>
# Blacklist a module permanently
echo "blacklist <module_name>" > /etc/modprobe.d/blacklist-problem.conf
# Rebuild initramfs
dracut -f # RHEL/CentOS
update-initramfs -u # Debian/Ubuntu
# Load a module with debug parameters
modprobe <module_name> debug=1
Gotcha: Kernel Panic Auto-Reboot Loop¶
The system panics, reboots (panic=N is set), hits the same bug, panics again. Infinite loop. You can't SSH in because it reboots every 30 seconds.
# At GRUB menu (via IPMI console):
# Edit kernel line, add:
panic=0
# This makes the system HALT on panic instead of reboot
# Now you can read the panic message on the console
# If the panic is caused by a module:
# Add to kernel line:
modprobe.blacklist=problematic_module
# If the panic is during boot:
# Boot into rescue mode from GRUB
# Or add: systemd.unit=rescue.target
# If you can't get to GRUB:
# Boot from rescue media (installation ISO in rescue mode)
Pattern: OOM Kill Investigation¶
# 1. Find recent OOM kills
dmesg -T | grep -B5 -A15 "Out of memory"
# Key lines to look for:
# "Out of memory: Kill process 12345 (java) score 820 or sacrifice child"
# "Killed process 12345 (java) total-vm:8388608kB, anon-rss:7340032kB"
# ^ total-vm = virtual, anon-rss = actual physical memory used
Under the hood: The OOM killer scores processes by RSS size relative to total memory, then adjusts by
oom_score_adj. A process withoom_score_adj=-1000is immune;+1000is always killed first. Kubernetes setsoom_score_adjbased on QoS class: Guaranteed=-997, BestEffort=1000, Burstable=scaled by request ratio. This is why BestEffort pods die first.
# 2. Check what triggered OOM (memory allocation that failed)
dmesg -T | grep "invoked oom-killer"
# "nginx invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)"
# 3. Check current memory pressure
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Committed"
# 4. Find current top memory consumers
ps aux --sort=-%mem | head -10
# 5. Check OOM scores (higher = more likely to be killed)
for pid in $(ls /proc/ | grep -E '^[0-9]+$' | head -50); do
name=$(cat /proc/$pid/comm 2>/dev/null)
score=$(cat /proc/$pid/oom_score 2>/dev/null)
[ -n "$score" ] && [ "$score" -gt 100 ] && echo "$pid $name $score"
done | sort -k3 -rn | head -10
# 6. Protect critical processes
echo -1000 > /proc/$(pgrep -f "critical-process")/oom_score_adj
Gotcha: Wrong crashkernel Size¶
kdump is configured but fails to capture a dump because the reserved memory is too small for the crash kernel to boot.
# Check current reservation
cat /proc/cmdline | grep crashkernel
cat /proc/iomem | grep -i crash
# Recommended sizes:
# < 4GB RAM: crashkernel=128M
# 4-64GB RAM: crashkernel=256M
# 64-1TB RAM: crashkernel=512M
# > 1TB RAM: crashkernel=1G
# Test kdump capture (WILL CRASH THE SYSTEM)
# Only do this during a maintenance window!
echo c > /proc/sysrq-trigger
# System will panic, kdump should capture, then reboot
# Check /var/crash/ after reboot
Pattern: Monitoring Kernel Health Proactively¶
# Add to monitoring/cron:
# Check for new dmesg errors every 5 minutes
*/5 * * * * dmesg -T -l err,crit,alert,emerg | tail -5 | \
logger -t kernel-monitor
# Check for taint
cat /proc/sys/kernel/tainted # Alert if non-zero unexpectedly
# Check for soft lockups
dmesg | grep -c "soft lockup" # Alert if > 0
# Check for OOM kills
dmesg | grep -c "Killed process" # Alert if > 0
# Check kdump readiness
kdumpctl status | grep -q "operational" || echo "KDUMP NOT READY"