Skip to content

Linux Deep Triage

Advanced debugging tools for when basic troubleshooting has not found the root cause.

perf: CPU Profiling

When To Use

  • Application is consuming high CPU but you do not know which function or code path
  • You need to identify hot spots in compiled binaries or kernel code
  • Latency spikes that correlate with CPU usage
  • You want data, not guesses, about where CPU time is spent

Quick Start

# See what is burning CPU right now (live, like top for functions)
perf top

# Record a profile for 30 seconds system-wide
perf record -g -a -- sleep 30

# Record a specific process
perf record -g -p $(pgrep myapp) -- sleep 30

# Analyze the recording
perf report
# Navigate with arrow keys. Enter to expand call chains.
# Look at the "Overhead" column -- highest percentage = hottest code path.

# Generate a flame graph (requires Brendan Gregg's FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# Open in a browser. Wide bars = functions spending the most CPU time.

Real Example

# Application is at 100% CPU. What is it doing?
perf record -g -p $(pgrep python3) -- sleep 10
perf report --stdio | head -40
# Output shows 60% of time in json.loads() -> you have a serialization bottleneck

# For Java: perf sees JIT-compiled frames as "[unknown]"
# Fix: run Java with -XX:+PreserveFramePointer and use perf-map-agent

Gotchas

  • perf record writes to perf.data in the current directory. It can get large (hundreds of MB for long recordings). Use -o /tmp/perf.data to control location.
  • Kernel symbols require /proc/kallsyms to be readable. If you see hex addresses instead of function names: echo 0 > /proc/sys/kernel/kptr_restrict.
  • In containers, perf needs SYS_ADMIN capability or you must run it on the host and filter by PID.

OOM Killer Analysis

When To Use

  • Process was killed unexpectedly with no application-level error
  • dmesg shows "Out of memory" messages
  • Containers restarting with exit code 137
  • System becomes unresponsive periodically then recovers (OOM killed a large process)

Quick Start

# Check for recent OOM events
dmesg -T | grep -i "out of memory\|oom-killer\|killed process"

# Detailed OOM report (dmesg shows full memory state at time of kill)
dmesg -T | grep -A 30 "oom-killer"
# Key fields:
#   "Killed process <PID> (name) total-vm:XXkB, anon-rss:XXkB"
#   anon-rss = actual physical memory used by the process

# Check current memory state
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|Committed_AS"
# MemAvailable = what the kernel thinks is available (includes reclaimable caches)
# Committed_AS = total memory promised to processes (can exceed physical RAM with overcommit)

# Check overcommit settings
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default, allows some overcommit)
# 1 = always overcommit (never refuse malloc)
# 2 = strict (refuse if commit exceeds swap + ratio*physical)

Cgroup OOM vs Host OOM

These are different events and require different responses.

# Cgroup OOM (container hit its memory limit):
dmesg -T | grep "memory cgroup out of memory"
# The cgroup limit is enforced, not the host running out of memory
# Fix: increase the container memory limit or fix the memory leak

# Check cgroup limits for a running container
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
# Or with cgroup v2:
cat /sys/fs/cgroup/<slice>/memory.max
cat /sys/fs/cgroup/<slice>/memory.current

# Host OOM (entire system out of memory):
# The kernel chooses a victim based on oom_score
# Check which process the kernel would kill next:
ps -eo pid,oom_score,comm | sort -k2 -rn | head -10

# Protect a critical process from OOM killer:
echo -1000 > /proc/<pid>/oom_score_adj
# Only do this for truly critical processes (like a database)

Gotchas

  • OOM kills are logged in dmesg but NOT always in /var/log/syslog (depends on rsyslog config). Always check dmesg first.
  • A process showing 10GB VSZ (virtual size) is not necessarily a problem. RSS (resident set size) is actual physical memory used.
  • MemAvailable dropping to near zero does not mean OOM is imminent. The kernel aggressively uses free memory for page cache. Look at MemAvailable, not MemFree.

IO Triage

When To Use

  • System feels sluggish but CPU usage is low
  • top shows high wa (IO wait) percentage
  • Database queries suddenly slow
  • Application log writes are blocking

Quick Start

# Overview: which devices are busy?
iostat -xz 2 5
# Key columns:
#   %util   - how busy the device is (>90% = saturated for spinning disks, less meaningful for SSDs)
#   await   - average IO latency in ms (>10ms for SSD = problem, >20ms for HDD = normal)
#   r/s,w/s - IOPS
#   rkB/s, wkB/s - throughput

# Who is doing the IO?
iotop -oa
# Shows cumulative IO per process. -o = only show processes doing IO.
# Requires root. If not available: use pidstat -d 2

# Check for IO errors
dmesg -T | grep -i "i/o error\|buffer i/o\|read error\|write error"

# Deep dive: block-level tracing
blktrace -d /dev/sda -o - | blkparse -i - | head -100
# Shows every IO operation at the block layer
# Useful for identifying IO patterns (sequential vs random, read vs write)

# Check scheduler and queue depth
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/nr_requests
# For SSDs: "none" or "mq-deadline" is usually best
# For HDDs: "bfq" or "mq-deadline"

Real Example

# Database is slow. Is it IO?
iostat -xz 2 3
# sda shows %util=98%, await=45ms. The disk is saturated.

# What process is causing it?
iotop -oa -t
# mysqld is doing 500 writes/sec. Check slow query log.

# Is it the journal? Swap? Something else?
pidstat -d 2 | sort -k5 -rn | head
# Shows per-process IO breakdown

Gotchas

  • %util at 100% for NVMe does NOT mean the device is saturated. NVMe drives handle massive parallelism. Look at await instead.
  • iostat first report shows averages since boot. Ignore it. Use the second and subsequent reports for current state.
  • If iotop is not installed and you cannot install packages: cat /proc/<pid>/io shows per-process IO counters.

systemd Deep Ops

When To Use

  • Services failing to start with unclear error messages
  • Resource isolation between services (CPU, memory, IO)
  • Journal corruption or missing logs
  • Understanding why a cgroup is limiting a process

Quick Start

# Service debugging
systemctl status myapp.service          # Basic state + recent logs
journalctl -u myapp.service -n 50       # Last 50 log lines
journalctl -u myapp.service -p err      # Only errors
systemctl show myapp.service            # ALL properties (hundreds of them)

# Slice hierarchy (cgroup tree for resource management)
systemd-cgls                            # Full cgroup tree
systemd-cgtop                          # Real-time resource usage per cgroup

# Resource accounting: see what each service is consuming
systemctl show myapp.service -p MemoryCurrent,CPUUsageNSec,IPIngressBytes

# Set resource limits on a service (transient, until restart)
systemctl set-property myapp.service MemoryMax=512M CPUQuota=200%
# Persistent: add to the unit file [Service] section:
#   MemoryMax=512M
#   CPUQuota=200%

# cgroup v2 direct inspection
cat /sys/fs/cgroup/system.slice/myapp.service/memory.current
cat /sys/fs/cgroup/system.slice/myapp.service/memory.max
cat /sys/fs/cgroup/system.slice/myapp.service/cpu.stat

Journal Corruption Recovery

# Symptoms: journalctl shows "Journal file corrupted" or returns no output

# Check journal health
journalctl --verify
# Shows corrupt journal files

# Fix: remove corrupt files and restart
rm /var/log/journal/*/system@*.journal~    # Corrupt files end with ~
rm /var/log/journal/*/user-*.journal~
systemctl restart systemd-journald

# If all journals are corrupt:
rm -rf /var/log/journal/*
systemctl restart systemd-journald
# You lose all historical logs. This is the nuclear option.

# Prevent recurrence: check disk space
df -h /var/log/journal/
# Set journal size limit:
# /etc/systemd/journald.conf:
#   SystemMaxUse=2G
#   SystemKeepFree=1G
systemctl restart systemd-journald

Gotchas

  • systemctl restart and systemctl stop && systemctl start are NOT the same. Restart sends SIGTERM, waits, sends SIGKILL, then starts. Stop+start gives you a window to check state between.
  • CPUQuota=200% means 2 full CPU cores (100% per core). This is not a typo.
  • MemoryMax kills the process when exceeded. MemoryHigh throttles it instead (slows down, does not kill). Use MemoryHigh for soft limits.

strace / ltrace: Last-Resort Debugging

When To Use

  • Application fails with no useful log output
  • You need to see exactly what system calls a process is making
  • File permission errors that are not obvious
  • Network connection failures at the syscall level
  • "Works on my machine" problems where environment differences matter

Quick Start

# Trace a running process (attach)
strace -p $(pgrep myapp) -f -tt -o /tmp/strace.out
# -f = follow child processes (threads)
# -tt = microsecond timestamps
# -o = write to file (do not pollute terminal)

# Trace a command from start
strace -f -tt -o /tmp/strace.out myapp --start

# Filter by syscall type (much less noise)
strace -p <pid> -e trace=open,openat,read,write   # File operations
strace -p <pid> -e trace=network                    # Network operations
strace -p <pid> -e trace=file                       # File-related syscalls

# Show a summary of time spent in each syscall
strace -c -p <pid>
# After Ctrl+C, shows a table of syscall counts, errors, and time
# Useful to see if the process is stuck in a specific syscall

# ltrace: same concept but for library calls (libc, etc.)
ltrace -p <pid> -e malloc,free     # Track memory allocations
ltrace -p <pid> -e getenv          # See what env vars it reads

Real Example

# Application exits silently with code 1. No logs.
strace -f ./myapp 2>&1 | tail -30
# Output shows:
#   openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT
#   write(2, "fatal error\n", 12)
# The config file is missing. The app wrote "fatal error" to stderr
# but stderr was redirected to /dev/null in the systemd unit.

# Application connects to database but gets timeout
strace -p <pid> -e trace=network
# Shows: connect(3, {sa_family=AF_INET, sin_port=htons(5432),
#   sin_addr=inet_addr("10.0.1.50")}, 16) = -1 ETIMEDOUT
# The database IP is unreachable. Not a DNS problem -- it resolved fine.

Gotchas

  • strace adds significant overhead (10-100x slowdown). Never leave it attached to a production process longer than necessary. Seconds, not minutes.
  • In containers, you may need SYS_PTRACE capability to attach. Or run strace on the host: nsenter -t <pid> -m -u -i -n -p -- strace -p 1.
  • strace output is overwhelming. Always filter with -e trace= or write to a file and grep.
  • ltrace does not work well with statically compiled binaries (Go, Rust). Use strace instead.
  • For Go applications, strace is useful but GODEBUG environment variables and delve debugger are often more productive.