strace — Street-Level Ops¶

Quick Diagnosis Commands¶

# Trace a command from start (basic)
strace ls /tmp 2>&1 | tail -20

# Attach to a running process (follow forks, timestamps, durations)
sudo strace -fp $(pgrep -f myapp) -tt -T -o /tmp/trace.log

# Only file operations (most common for "file not found" issues)
strace -e trace=file ./myapp 2>&1 | grep -E 'ENOENT|EACCES'

# Only network operations (connection issues)
strace -e trace=network ./myapp 2>&1 | grep -E 'connect|ECONNREFUSED|ETIMEDOUT'

# Syscall summary (like a profiler for syscalls)
strace -c ./myapp --process-batch 2>&1
# Shows: time%, seconds, usecs/call, calls, errors, syscall

# Annotate file descriptors with paths (-y flag)
strace -y -e trace=read,write -p $(pgrep -f myapp) 2>&1 | head -20
# read(3</etc/hosts>, "...", 4096) = 127
# Instead of: read(3, "...", 4096) = 127

# Trace only specific syscalls
strace -e trace=openat,connect,write ./myapp 2>&1

# Follow forks and write one file per thread
strace -ff -o /tmp/trace -p $(pgrep -f myapp)
# Creates: /tmp/trace.12345, /tmp/trace.12346, ...

# Show string arguments with more characters (default truncates at 32)
strace -s 256 -e trace=write -p $(pgrep -f myapp) 2>&1 | head -20

# Quick check: what config files does a program read?
strace -e trace=openat ./myapp 2>&1 | grep -v ENOENT | grep -E '\.(conf|cfg|yaml|yml|ini|json)'

Gotcha: Process Fails Silently — No Useful Logs¶

Symptom: A service starts and immediately exits with code 1. The log file is empty or says only "startup failed". No stack trace, no error message.

Rule: strace shows what the process actually tried to do at the kernel level. Missing files, permission denials, and failed connections all leave clear syscall evidence.

One-liner: When an app fails silently, strace is your black box recorder. The error codes tell the whole story: ENOENT = file missing, EACCES = permission denied, ECONNREFUSED = backend down.

# Step 1: Run the service under strace, capture everything
strace -f -tt -T -o /tmp/trace.log ./myapp start

# Step 2: Grep for the most common failure patterns
# Missing file:
grep ENOENT /tmp/trace.log
# openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT

# Permission denied:
grep EACCES /tmp/trace.log
# openat(AT_FDCWD, "/var/run/myapp.sock", O_RDWR) = -1 EACCES

# Connection refused (backend service down):
grep ECONNREFUSED /tmp/trace.log
# connect(4, {sa_family=AF_INET, sin_port=htons(5432),
#   sin_addr=inet_addr("10.0.1.50")}, 16) = -1 ECONNREFUSED

# DNS resolution failure:
grep -E 'resolv|hosts|getaddrinfo' /tmp/trace.log

# Step 3: Look at the last 20 syscalls before the process exited
tail -20 /tmp/trace.log
# The exit_group() call is the last line — work backwards from there

Gotcha: Service Hangs — Not Crashing, Not Responding¶

Symptom: Process is running (shows in ps) but not responding to requests. CPU is near zero. It is not crashed — it is stuck.

# Step 1: Attach strace and see what it is blocked on
sudo strace -p $(pgrep -f myapp) 2>&1 | head -5
# If you see a single syscall and nothing else → that is what it is stuck on

# Common "stuck on" patterns:
# futex(0x..., FUTEX_WAIT, ...)     → lock contention / deadlock
# read(5, ^C                        → waiting for network data (slow upstream)
# connect(4, {AF_INET, 10.0.1.50:5432}, ...) → TCP handshake hanging (firewall?)
# epoll_wait(3, [], 1024, -1)       → event loop idle (normal if no traffic)
# select(0, NULL, NULL, NULL, {tv_sec=300}) → sleeping in a timer

# Step 2: If stuck on a network read, identify the remote end
sudo strace -yp $(pgrep -f myapp) 2>&1 | head -5
# read(5<TCP:[10.0.1.50:5432->10.0.1.1:42318]>, ^C
# → Stuck reading from PostgreSQL at 10.0.1.50

# Step 3: If stuck on futex (possible deadlock)
# Get thread info:
sudo strace -fp $(pgrep -f myapp) 2>&1 | head -20
# Multiple threads all in futex(FUTEX_WAIT) = possible deadlock
# Use gdb for proper deadlock analysis at this point

Debug clue: If strace shows epoll_wait(3, [], 1024, -1) repeating with no other activity, the process is idle and waiting for incoming connections. This is normal for an idle web server. If you expected it to be busy, check that traffic is actually reaching the process (wrong port? load balancer not routing?).

Pattern: Tracing What Config Files a Program Reads¶

When you do not know where a program finds its config, strace tells you directly.

# Option 1: Show all file opens
strace -e trace=openat ./myapp 2>&1 | grep -v ENOENT
# Shows every file successfully opened

# Option 2: Show the search path (including files it tried and failed)
strace -e trace=openat ./myapp 2>&1 | grep -E '\.(conf|yaml|json|ini|cfg|env)'
# openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT
# openat(AT_FDCWD, "/home/app/.myapp/config.yaml", O_RDONLY) = -1 ENOENT
# openat(AT_FDCWD, "./config.yaml", O_RDONLY) = 3
# ← Found it: using ./config.yaml

# Option 3: See what a command reads from a specific file
strace -e trace=openat,read -s 1024 ./myapp 2>&1 | grep -A1 'config'
# Shows both the open() and the read() contents

Pattern: Debugging DNS Resolution Issues¶

# Trace DNS resolution for a command
strace -e trace=openat,connect,sendto,recvfrom -f \
  curl -s http://api.example.com/ 2>&1 | grep -E 'resolv|53|hosts'

# What you will see:
# openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY) = 3
# openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
# connect(4, {AF_INET, sin_port=htons(53), sin_addr=10.0.0.2}, ...) = 0
# sendto(4, "\x12\x34\x01\x00...", 33, ...) = 33  ← DNS query
# recvfrom(4, "\x12\x34\x81\x80...", 512, ...) = 45  ← DNS response

# If DNS is slow, add timing:
strace -tt -T -e trace=connect,sendto,recvfrom -f \
  curl -s http://api.example.com/ 2>&1 | grep -E 'htons\(53\)|sendto|recvfrom'
# recvfrom(4, ...) = 45 <5.001234>
# ← 5-second delay on DNS response = DNS server is slow or unreachable

Pattern: Tracing Container Processes from the Host¶

# Step 1: Find the host PID
PID=$(docker inspect --format '{{.State.Pid}}' mycontainer)
# Or for Kubernetes:
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')

# Step 2: Trace from the host
sudo strace -fp $PID -e trace=network -tt -T -o /tmp/container_trace.log

# Step 3: If you need to trace inside the container (no host access)
# The container needs SYS_PTRACE capability:
# securityContext:
#   capabilities:
#     add: ["SYS_PTRACE"]

# Or use an ephemeral debug container (K8s 1.23+):
kubectl debug -it pod/myapp --image=nicolaka/netshoot --target=myapp
# Inside: strace -p 1 -e trace=network

Pattern: Finding Why a Service Takes Long to Start¶

# Trace startup with timing
strace -f -tt -T -o /tmp/startup.log ./myapp start

# Find the slowest individual syscalls
sort -t'<' -k2 -n -r /tmp/startup.log | head -10
# Lines ending with <5.001234> = 5 seconds spent in that syscall

# Find slow DNS resolution during startup
grep -E 'connect.*htons\(53\)|recvfrom' /tmp/startup.log | grep -v '<0\.'
# Anything > 1 second is suspicious

# Find slow file operations
grep -E 'openat|read|stat' /tmp/startup.log | \
  awk -F'<' '{if ($2+0 > 0.1) print}' | head -10

# Common startup delays:
# 1. DNS resolution timeout (5s per failed lookup)
# 2. Connecting to a database that is not ready
# 3. Reading from a slow NFS mount
# 4. Waiting on a lock file from a previous instance

Pattern: Tracing Only Errors¶

# Show only failed syscalls (return value = -1)
strace -Z -e trace=file ./myapp 2>&1
# -Z flag (strace 5.2+): only show syscalls that returned an error
# (uppercase Z = only errors, lowercase z = suppress successful calls)

# Older strace versions: post-filter
strace -e trace=file ./myapp 2>&1 | grep ' = -1 '

# Count errors by type
strace -c -Z ./myapp --process-batch 2>&1
# Shows only syscalls that had errors, with counts

Useful One-Liners¶

# What shared libraries does a program load?
strace -e trace=openat ./myapp 2>&1 | grep '\.so'

# Is a process writing to stderr? (fd 2)
strace -e trace=write -p $(pgrep -f myapp) 2>&1 | grep 'write(2,'

# What environment variables does a program check?
strace -e trace=openat -s 256 env -i HOME=/tmp PATH=/usr/bin ./myapp 2>&1

# How many syscalls per second is a process making?
timeout 5 strace -cp $(pgrep -f myapp) 2>&1 | tail -1
# Divide "calls" by 5 for per-second rate

# Trace file descriptor lifecycle for a specific connection
strace -y -e trace=socket,connect,read,write,close \
  -p $(pgrep -f myapp) 2>&1 | grep 'TCP'

# Find what is writing to a specific file
strace -e trace=openat,write -fp $(pgrep -f myapp) 2>&1 \
  | grep '/var/log/myapp'

# Compare syscall profiles of two commands
strace -c ./old_version 2>&1 > /dev/null
strace -c ./new_version 2>&1 > /dev/null
# Compare the summary tables for regressions

Gotcha: strace adds significant overhead (10-100x slowdown) because it uses ptrace to intercept every syscall. Never leave strace attached to a production process longer than needed. For continuous monitoring, use eBPF-based tools like bpftrace or perf which have near-zero overhead.

Remember: The most useful strace flags, in order of importance: -f (follow forks), -p PID (attach), -e trace=file or network (filter), -y (show fd paths), -T (show duration). Mnemonic: FPE-YT — Follow Process Events, Yes with Timing.

# Check if a process uses inotify (file watching)
strace -e trace=inotify_init,inotify_add_watch -p $(pgrep -f myapp) 2>&1