Linux Deep Triage¶
Advanced debugging tools for when basic troubleshooting has not found the root cause.
perf: CPU Profiling¶
When To Use¶
- Application is consuming high CPU but you do not know which function or code path
- You need to identify hot spots in compiled binaries or kernel code
- Latency spikes that correlate with CPU usage
- You want data, not guesses, about where CPU time is spent
Quick Start¶
# See what is burning CPU right now (live, like top for functions)
perf top
# Record a profile for 30 seconds system-wide
perf record -g -a -- sleep 30
# Record a specific process
perf record -g -p $(pgrep myapp) -- sleep 30
# Analyze the recording
perf report
# Navigate with arrow keys. Enter to expand call chains.
# Look at the "Overhead" column -- highest percentage = hottest code path.
# Generate a flame graph (requires Brendan Gregg's FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# Open in a browser. Wide bars = functions spending the most CPU time.
Real Example¶
# Application is at 100% CPU. What is it doing?
perf record -g -p $(pgrep python3) -- sleep 10
perf report --stdio | head -40
# Output shows 60% of time in json.loads() -> you have a serialization bottleneck
# For Java: perf sees JIT-compiled frames as "[unknown]"
# Fix: run Java with -XX:+PreserveFramePointer and use perf-map-agent
Gotchas¶
perf recordwrites toperf.datain the current directory. It can get large (hundreds of MB for long recordings). Use-o /tmp/perf.datato control location.- Kernel symbols require
/proc/kallsymsto be readable. If you see hex addresses instead of function names:echo 0 > /proc/sys/kernel/kptr_restrict. - In containers, perf needs
SYS_ADMINcapability or you must run it on the host and filter by PID.
OOM Killer Analysis¶
When To Use¶
- Process was killed unexpectedly with no application-level error
dmesgshows "Out of memory" messages- Containers restarting with exit code 137
- System becomes unresponsive periodically then recovers (OOM killed a large process)
Quick Start¶
# Check for recent OOM events
dmesg -T | grep -i "out of memory\|oom-killer\|killed process"
# Detailed OOM report (dmesg shows full memory state at time of kill)
dmesg -T | grep -A 30 "oom-killer"
# Key fields:
# "Killed process <PID> (name) total-vm:XXkB, anon-rss:XXkB"
# anon-rss = actual physical memory used by the process
# Check current memory state
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Committed_AS"
# MemAvailable = what the kernel thinks is available (includes reclaimable caches)
# Committed_AS = total memory promised to processes (can exceed physical RAM with overcommit)
# Check overcommit settings
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default, allows some overcommit)
# 1 = always overcommit (never refuse malloc)
# 2 = strict (refuse if commit exceeds swap + ratio*physical)
Cgroup OOM vs Host OOM¶
These are different events and require different responses.
# Cgroup OOM (container hit its memory limit):
dmesg -T | grep "memory cgroup out of memory"
# The cgroup limit is enforced, not the host running out of memory
# Fix: increase the container memory limit or fix the memory leak
# Check cgroup limits for a running container
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
# Or with cgroup v2:
cat /sys/fs/cgroup/<slice>/memory.max
cat /sys/fs/cgroup/<slice>/memory.current
# Host OOM (entire system out of memory):
# The kernel chooses a victim based on oom_score
# Check which process the kernel would kill next:
ps -eo pid,oom_score,comm | sort -k2 -rn | head -10
# Protect a critical process from OOM killer:
echo -1000 > /proc/<pid>/oom_score_adj
# Only do this for truly critical processes (like a database)
Gotchas¶
- OOM kills are logged in
dmesgbut NOT always in/var/log/syslog(depends on rsyslog config). Always checkdmesgfirst. - A process showing 10GB VSZ (virtual size) is not necessarily a problem. RSS (resident set size) is actual physical memory used.
MemAvailabledropping to near zero does not mean OOM is imminent. The kernel aggressively uses free memory for page cache. Look atMemAvailable, notMemFree.
IO Triage¶
When To Use¶
- System feels sluggish but CPU usage is low
topshows highwa(IO wait) percentage- Database queries suddenly slow
- Application log writes are blocking
Quick Start¶
# Overview: which devices are busy?
iostat -xz 2 5
# Key columns:
# %util - how busy the device is (>90% = saturated for spinning disks, less meaningful for SSDs)
# await - average IO latency in ms (>10ms for SSD = problem, >20ms for HDD = normal)
# r/s,w/s - IOPS
# rkB/s, wkB/s - throughput
# Who is doing the IO?
iotop -oa
# Shows cumulative IO per process. -o = only show processes doing IO.
# Requires root. If not available: use pidstat -d 2
# Check for IO errors
dmesg -T | grep -i "i/o error\|buffer i/o\|read error\|write error"
# Deep dive: block-level tracing
blktrace -d /dev/sda -o - | blkparse -i - | head -100
# Shows every IO operation at the block layer
# Useful for identifying IO patterns (sequential vs random, read vs write)
# Check scheduler and queue depth
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/nr_requests
# For SSDs: "none" or "mq-deadline" is usually best
# For HDDs: "bfq" or "mq-deadline"
Real Example¶
# Database is slow. Is it IO?
iostat -xz 2 3
# sda shows %util=98%, await=45ms. The disk is saturated.
# What process is causing it?
iotop -oa -t
# mysqld is doing 500 writes/sec. Check slow query log.
# Is it the journal? Swap? Something else?
pidstat -d 2 | sort -k5 -rn | head
# Shows per-process IO breakdown
Gotchas¶
%utilat 100% for NVMe does NOT mean the device is saturated. NVMe drives handle massive parallelism. Look atawaitinstead.iostatfirst report shows averages since boot. Ignore it. Use the second and subsequent reports for current state.- If
iotopis not installed and you cannot install packages:cat /proc/<pid>/ioshows per-process IO counters.
systemd Deep Ops¶
When To Use¶
- Services failing to start with unclear error messages
- Resource isolation between services (CPU, memory, IO)
- Journal corruption or missing logs
- Understanding why a cgroup is limiting a process
Quick Start¶
# Service debugging
systemctl status myapp.service # Basic state + recent logs
journalctl -u myapp.service -n 50 # Last 50 log lines
journalctl -u myapp.service -p err # Only errors
systemctl show myapp.service # ALL properties (hundreds of them)
# Slice hierarchy (cgroup tree for resource management)
systemd-cgls # Full cgroup tree
systemd-cgtop # Real-time resource usage per cgroup
# Resource accounting: see what each service is consuming
systemctl show myapp.service -p MemoryCurrent,CPUUsageNSec,IPIngressBytes
# Set resource limits on a service (transient, until restart)
systemctl set-property myapp.service MemoryMax=512M CPUQuota=200%
# Persistent: add to the unit file [Service] section:
# MemoryMax=512M
# CPUQuota=200%
# cgroup v2 direct inspection
cat /sys/fs/cgroup/system.slice/myapp.service/memory.current
cat /sys/fs/cgroup/system.slice/myapp.service/memory.max
cat /sys/fs/cgroup/system.slice/myapp.service/cpu.stat
Journal Corruption Recovery¶
# Symptoms: journalctl shows "Journal file corrupted" or returns no output
# Check journal health
journalctl --verify
# Shows corrupt journal files
# Fix: remove corrupt files and restart
rm /var/log/journal/*/system@*.journal~ # Corrupt files end with ~
rm /var/log/journal/*/user-*.journal~
systemctl restart systemd-journald
# If all journals are corrupt:
rm -rf /var/log/journal/*
systemctl restart systemd-journald
# You lose all historical logs. This is the nuclear option.
# Prevent recurrence: check disk space
df -h /var/log/journal/
# Set journal size limit:
# /etc/systemd/journald.conf:
# SystemMaxUse=2G
# SystemKeepFree=1G
systemctl restart systemd-journald
Gotchas¶
systemctl restartandsystemctl stop && systemctl startare NOT the same. Restart sends SIGTERM, waits, sends SIGKILL, then starts. Stop+start gives you a window to check state between.CPUQuota=200%means 2 full CPU cores (100% per core). This is not a typo.MemoryMaxkills the process when exceeded.MemoryHighthrottles it instead (slows down, does not kill). UseMemoryHighfor soft limits.
strace / ltrace: Last-Resort Debugging¶
When To Use¶
- Application fails with no useful log output
- You need to see exactly what system calls a process is making
- File permission errors that are not obvious
- Network connection failures at the syscall level
- "Works on my machine" problems where environment differences matter
Quick Start¶
# Trace a running process (attach)
strace -p $(pgrep myapp) -f -tt -o /tmp/strace.out
# -f = follow child processes (threads)
# -tt = microsecond timestamps
# -o = write to file (do not pollute terminal)
# Trace a command from start
strace -f -tt -o /tmp/strace.out myapp --start
# Filter by syscall type (much less noise)
strace -p <pid> -e trace=open,openat,read,write # File operations
strace -p <pid> -e trace=network # Network operations
strace -p <pid> -e trace=file # File-related syscalls
# Show a summary of time spent in each syscall
strace -c -p <pid>
# After Ctrl+C, shows a table of syscall counts, errors, and time
# Useful to see if the process is stuck in a specific syscall
# ltrace: same concept but for library calls (libc, etc.)
ltrace -p <pid> -e malloc,free # Track memory allocations
ltrace -p <pid> -e getenv # See what env vars it reads
Real Example¶
# Application exits silently with code 1. No logs.
strace -f ./myapp 2>&1 | tail -30
# Output shows:
# openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT
# write(2, "fatal error\n", 12)
# The config file is missing. The app wrote "fatal error" to stderr
# but stderr was redirected to /dev/null in the systemd unit.
# Application connects to database but gets timeout
strace -p <pid> -e trace=network
# Shows: connect(3, {sa_family=AF_INET, sin_port=htons(5432),
# sin_addr=inet_addr("10.0.1.50")}, 16) = -1 ETIMEDOUT
# The database IP is unreachable. Not a DNS problem -- it resolved fine.
Gotchas¶
- strace adds significant overhead (10-100x slowdown). Never leave it attached to a production process longer than necessary. Seconds, not minutes.
- In containers, you may need
SYS_PTRACEcapability to attach. Or run strace on the host:nsenter -t <pid> -m -u -i -n -p -- strace -p 1. - strace output is overwhelming. Always filter with
-e trace=or write to a file and grep. - ltrace does not work well with statically compiled binaries (Go, Rust). Use strace instead.
- For Go applications, strace is useful but
GODEBUGenvironment variables anddelvedebugger are often more productive.