Linux Deep Triage¶

Advanced debugging tools for when basic troubleshooting has not found the root cause.

perf: CPU Profiling¶

When To Use¶

Application is consuming high CPU but you do not know which function or code path
You need to identify hot spots in compiled binaries or kernel code
Latency spikes that correlate with CPU usage
You want data, not guesses, about where CPU time is spent

Quick Start¶

# See what is burning CPU right now (live, like top for functions)
perf top

# Record a profile for 30 seconds system-wide
perf record -g -a -- sleep 30

# Record a specific process
perf record -g -p $(pgrep myapp) -- sleep 30

# Analyze the recording
perf report
# Navigate with arrow keys. Enter to expand call chains.
# Look at the "Overhead" column -- highest percentage = hottest code path.

# Generate a flame graph (requires Brendan Gregg's FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# Open in a browser. Wide bars = functions spending the most CPU time.

Real Example¶

# Application is at 100% CPU. What is it doing?
perf record -g -p $(pgrep python3) -- sleep 10
perf report --stdio | head -40
# Output shows 60% of time in json.loads() -> you have a serialization bottleneck

# For Java: perf sees JIT-compiled frames as "[unknown]"
# Fix: run Java with -XX:+PreserveFramePointer and use perf-map-agent

Gotchas¶

perf record writes to perf.data in the current directory. It can get large (hundreds of MB for long recordings). Use -o /tmp/perf.data to control location.
Kernel symbols require /proc/kallsyms to be readable. If you see hex addresses instead of function names: echo 0 > /proc/sys/kernel/kptr_restrict.
In containers, perf needs SYS_ADMIN capability or you must run it on the host and filter by PID.

OOM Killer Analysis¶

When To Use¶

Process was killed unexpectedly with no application-level error
dmesg shows "Out of memory" messages
Containers restarting with exit code 137
System becomes unresponsive periodically then recovers (OOM killed a large process)

Quick Start¶

# Check for recent OOM events
dmesg -T | grep -i "out of memory\|oom-killer\|killed process"

# Detailed OOM report (dmesg shows full memory state at time of kill)
dmesg -T | grep -A 30 "oom-killer"
# Key fields:
#   "Killed process <PID> (name) total-vm:XXkB, anon-rss:XXkB"
#   anon-rss = actual physical memory used by the process

# Check current memory state
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Committed_AS"
# MemAvailable = what the kernel thinks is available (includes reclaimable caches)
# Committed_AS = total memory promised to processes (can exceed physical RAM with overcommit)

# Check overcommit settings
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default, allows some overcommit)
# 1 = always overcommit (never refuse malloc)
# 2 = strict (refuse if commit exceeds swap + ratio*physical)

Cgroup OOM vs Host OOM¶

These are different events and require different responses.

# Cgroup OOM (container hit its memory limit):
dmesg -T | grep "memory cgroup out of memory"
# The cgroup limit is enforced, not the host running out of memory
# Fix: increase the container memory limit or fix the memory leak

# Check cgroup limits for a running container
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
# Or with cgroup v2:
cat /sys/fs/cgroup/<slice>/memory.max
cat /sys/fs/cgroup/<slice>/memory.current

# Host OOM (entire system out of memory):
# The kernel chooses a victim based on oom_score
# Check which process the kernel would kill next:
ps -eo pid,oom_score,comm | sort -k2 -rn | head -10

# Protect a critical process from OOM killer:
echo -1000 > /proc/<pid>/oom_score_adj
# Only do this for truly critical processes (like a database)

Gotchas¶

OOM kills are logged in dmesg but NOT always in /var/log/syslog (depends on rsyslog config). Always check dmesg first.
A process showing 10GB VSZ (virtual size) is not necessarily a problem. RSS (resident set size) is actual physical memory used.
MemAvailable dropping to near zero does not mean OOM is imminent. The kernel aggressively uses free memory for page cache. Look at MemAvailable, not MemFree.

IO Triage¶

When To Use¶

System feels sluggish but CPU usage is low
top shows high wa (IO wait) percentage
Database queries suddenly slow
Application log writes are blocking

Quick Start¶

# Overview: which devices are busy?
iostat -xz 2 5
# Key columns:
#   %util   - how busy the device is (>90% = saturated for spinning disks, less meaningful for SSDs)
#   await   - average IO latency in ms (>10ms for SSD = problem, >20ms for HDD = normal)
#   r/s,w/s - IOPS
#   rkB/s, wkB/s - throughput

# Who is doing the IO?
iotop -oa
# Shows cumulative IO per process. -o = only show processes doing IO.
# Requires root. If not available: use pidstat -d 2

# Check for IO errors
dmesg -T | grep -i "i/o error\|buffer i/o\|read error\|write error"

# Deep dive: block-level tracing
blktrace -d /dev/sda -o - | blkparse -i - | head -100
# Shows every IO operation at the block layer
# Useful for identifying IO patterns (sequential vs random, read vs write)

# Check scheduler and queue depth
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/nr_requests
# For SSDs: "none" or "mq-deadline" is usually best
# For HDDs: "bfq" or "mq-deadline"

Real Example¶

# Database is slow. Is it IO?
iostat -xz 2 3
# sda shows %util=98%, await=45ms. The disk is saturated.

# What process is causing it?
iotop -oa -t
# mysqld is doing 500 writes/sec. Check slow query log.

# Is it the journal? Swap? Something else?
pidstat -d 2 | sort -k5 -rn | head
# Shows per-process IO breakdown

Gotchas¶

%util at 100% for NVMe does NOT mean the device is saturated. NVMe drives handle massive parallelism. Look at await instead.
iostat first report shows averages since boot. Ignore it. Use the second and subsequent reports for current state.
If iotop is not installed and you cannot install packages: cat /proc/<pid>/io shows per-process IO counters.

systemd Deep Ops¶

When To Use¶

Services failing to start with unclear error messages
Resource isolation between services (CPU, memory, IO)
Journal corruption or missing logs
Understanding why a cgroup is limiting a process

Quick Start¶

# Service debugging
systemctl status myapp.service          # Basic state + recent logs
journalctl -u myapp.service -n 50       # Last 50 log lines
journalctl -u myapp.service -p err      # Only errors
systemctl show myapp.service            # ALL properties (hundreds of them)

# Slice hierarchy (cgroup tree for resource management)
systemd-cgls                            # Full cgroup tree
systemd-cgtop                          # Real-time resource usage per cgroup

# Resource accounting: see what each service is consuming
systemctl show myapp.service -p MemoryCurrent,CPUUsageNSec,IPIngressBytes

# Set resource limits on a service (transient, until restart)
systemctl set-property myapp.service MemoryMax=512M CPUQuota=200%
# Persistent: add to the unit file [Service] section:
#   MemoryMax=512M
#   CPUQuota=200%

# cgroup v2 direct inspection
cat /sys/fs/cgroup/system.slice/myapp.service/memory.current
cat /sys/fs/cgroup/system.slice/myapp.service/memory.max
cat /sys/fs/cgroup/system.slice/myapp.service/cpu.stat

Journal Corruption Recovery¶

# Symptoms: journalctl shows "Journal file corrupted" or returns no output

# Check journal health
journalctl --verify
# Shows corrupt journal files

# Fix: remove corrupt files and restart
rm /var/log/journal/*/system@*.journal~    # Corrupt files end with ~
rm /var/log/journal/*/user-*.journal~
systemctl restart systemd-journald

# If all journals are corrupt:
rm -rf /var/log/journal/*
systemctl restart systemd-journald
# You lose all historical logs. This is the nuclear option.

# Prevent recurrence: check disk space
df -h /var/log/journal/
# Set journal size limit:
# /etc/systemd/journald.conf:
#   SystemMaxUse=2G
#   SystemKeepFree=1G
systemctl restart systemd-journald

Gotchas¶

systemctl restart and systemctl stop && systemctl start are NOT the same. Restart sends SIGTERM, waits, sends SIGKILL, then starts. Stop+start gives you a window to check state between.
CPUQuota=200% means 2 full CPU cores (100% per core). This is not a typo.
MemoryMax kills the process when exceeded. MemoryHigh throttles it instead (slows down, does not kill). Use MemoryHigh for soft limits.

strace / ltrace: Last-Resort Debugging¶

When To Use¶

Application fails with no useful log output
You need to see exactly what system calls a process is making
File permission errors that are not obvious
Network connection failures at the syscall level
"Works on my machine" problems where environment differences matter

Quick Start¶

# Trace a running process (attach)
strace -p $(pgrep myapp) -f -tt -o /tmp/strace.out
# -f = follow child processes (threads)
# -tt = microsecond timestamps
# -o = write to file (do not pollute terminal)

# Trace a command from start
strace -f -tt -o /tmp/strace.out myapp --start

# Filter by syscall type (much less noise)
strace -p <pid> -e trace=open,openat,read,write   # File operations
strace -p <pid> -e trace=network                    # Network operations
strace -p <pid> -e trace=file                       # File-related syscalls

# Show a summary of time spent in each syscall
strace -c -p <pid>
# After Ctrl+C, shows a table of syscall counts, errors, and time
# Useful to see if the process is stuck in a specific syscall

# ltrace: same concept but for library calls (libc, etc.)
ltrace -p <pid> -e malloc,free     # Track memory allocations
ltrace -p <pid> -e getenv          # See what env vars it reads

Real Example¶

# Application exits silently with code 1. No logs.
strace -f ./myapp 2>&1 | tail -30
# Output shows:
#   openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT
#   write(2, "fatal error\n", 12)
# The config file is missing. The app wrote "fatal error" to stderr
# but stderr was redirected to /dev/null in the systemd unit.

# Application connects to database but gets timeout
strace -p <pid> -e trace=network
# Shows: connect(3, {sa_family=AF_INET, sin_port=htons(5432),
#   sin_addr=inet_addr("10.0.1.50")}, 16) = -1 ETIMEDOUT
# The database IP is unreachable. Not a DNS problem -- it resolved fine.

Gotchas¶

strace adds significant overhead (10-100x slowdown). Never leave it attached to a production process longer than necessary. Seconds, not minutes.
In containers, you may need SYS_PTRACE capability to attach. Or run strace on the host: nsenter -t <pid> -m -u -i -n -p -- strace -p 1.
strace output is overwhelming. Always filter with -e trace= or write to a file and grep.
ltrace does not work well with statically compiled binaries (Go, Rust). Use strace instead.
For Go applications, strace is useful but GODEBUG environment variables and delve debugger are often more productive.