Portal | Level: L1: Foundations | Topics: strace, Tracing | Domain: Linux
strace - Primer¶
Why This Matters¶
When a process fails and the logs say nothing useful, you need a way to see what the program is actually doing. strace shows you every system call a process makes — every file it opens, every network connection it attempts, every permission check that fails. It works on any language, any binary, without source code or recompilation.
For DevOps and SRE work, strace is one of the fastest ways to answer "why is this failing?" in production. A missing config file, a connection refused, a permission denied — these all leave clear syscall-level evidence that strace makes visible in seconds. Once you learn to read strace output, you stop guessing and start seeing.
The skill compounds: the same patterns recur across services written in Python, Go, Java, or C. The kernel interface is universal, so learning strace once applies everywhere.
Core Concepts¶
1. System Calls — The Kernel Boundary¶
Programs cannot touch files, networks, or hardware directly. They ask the kernel via
system calls (syscalls). strace intercepts these calls using the ptrace mechanism
and prints each one with its arguments and return value.
open("/etc/app/config.yaml", O_RDONLY) = 3
read(3, "database:\n host: db.internal\n", 4096) = 31
close(3) = 0
connect(4, {sa_family=AF_INET, sin_port=htons(5432),
sin_addr=inet_addr("10.0.1.50")}, 16) = 0
Each line tells you: what was requested, on what target, and whether it succeeded.
Fun fact: strace uses the
ptrace()system call (process trace), which was added to Unix in the 1980s. The same mechanism is used by debuggers like GDB. ptrace pauses the traced process at every syscall boundary, which is why strace imposes significant overhead (10-100x slowdown). Never run strace on a production process under heavy load without understanding this cost.Remember: The three most diagnostic strace error codes: ENOENT (No such file or directory — missing file), EACCES (Permission denied — wrong user/permissions), ECONNREFUSED (Connection refused — service not listening). These three cover ~80% of what you are looking for when stracing a failing process.
Who made it: strace was originally written by Paul Kranenburg for SunOS in 1991. It was ported to Linux by Branko Lankester, who also wrote the kernel support. Rick Sladkey merged the SunOS and Linux versions in 1993, adding features from SVR4's
truss(1). Since 2009, strace has been maintained by Dmitry Levin. The tool has been continuously developed for over 30 years — a testament to how fundamental syscall tracing is for debugging.
2. Essential Syscalls for DevOps¶
| Syscall | What It Does | What to Look For |
|---|---|---|
open / openat |
Open a file | Missing files (ENOENT), permissions (EACCES) |
read |
Read bytes from a file descriptor | Slow reads, unexpected EOF |
write |
Write bytes to a file descriptor | Errors on fd 2 = stderr |
connect |
Initiate a TCP/UDP connection | ECONNREFUSED, ETIMEDOUT |
execve |
Replace process with a new program | Missing binaries, wrong PATH |
stat / fstat |
Check file metadata | File existence checks |
mmap |
Map files/memory into address space | Shared library loading |
clone |
Create child process/thread | Fork storms, thread creation |
3. Basic Usage¶
Run a command under strace:
# Trace a command, output to terminal
strace ls /tmp
# Save output to a file (recommended — output can be large)
strace -o trace.log ls /tmp
Attach to a running process:
4. Filtering with -e trace=¶
Raw strace output is noisy. Filter to the category you care about:
# Only file operations (open, read, write, close, stat, etc.)
strace -e trace=file ./myapp
# Only network operations (connect, accept, sendto, recvfrom, etc.)
strace -e trace=network ./myapp
# Only process operations (execve, clone, fork, wait, etc.)
strace -e trace=process ./myapp
# Only specific syscalls
strace -e trace=open,connect,write ./myapp
5. Timing Flags¶
# Wall-clock timestamp on each line (-tt = microsecond precision)
strace -tt ./myapp
# 14:23:01.443218 connect(3, ...) = -1 ETIMEDOUT
# Time spent inside each syscall (-T suffix on each line)
strace -T ./myapp
# connect(3, ...) = 0 <0.003412>
# Both together — the most useful combo for latency debugging
strace -tt -T ./myapp
One-liner:
strace -c ./myappgives a summary table of syscall counts, errors, and cumulative time — like a profiler for system calls. If you see 90% of time infutex(), the app is contention-bound. If 90% is inread()orwrite(), it is I/O-bound. This one flag turns strace from a debugging tool into a quick performance profiler.
6. Error Pattern Recognition¶
These patterns appear constantly in production debugging:
# File not found — missing config, missing binary, wrong path
openat(AT_FDCWD, "/etc/app/config.yaml", O_RDONLY) = -1 ENOENT
# Permission denied — wrong owner, missing read bit, SELinux
openat(AT_FDCWD, "/var/run/app.sock", O_RDWR) = -1 EACCES
# Connection refused — target service is down or wrong port
connect(3, {sa_family=AF_INET, sin_port=htons(5432),
sin_addr=inet_addr("10.0.1.50")}, 16) = -1 ECONNREFUSED
# Connection timed out — firewall, network partition, host unreachable
connect(3, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("203.0.113.10")}, 16) = -1 ETIMEDOUT
7. Practical Debugging Workflow¶
1. Reproduce or attach
- strace -o trace.log -f -tt -T <command>
- strace -o trace.log -f -tt -T -p <pid>
2. Filter to the problem domain
- File issues: grep -E 'ENOENT|EACCES' trace.log
- Network issues: grep -E 'connect|ECONNREFUSED|ETIMEDOUT' trace.log
- Exec failures: grep 'execve' trace.log
3. Read around the error
- Look at the 5-10 lines before the error for context
- Note the file descriptor numbers to trace a connection's lifecycle
4. Correlate with timestamps
- Long gaps between syscalls = process blocked on something
- Cluster of fast failures = retry loop
8. Multi-Threaded Programs and -ff¶
By default, strace with -f (follow forks) interleaves output from all
threads into one stream, making it hard to follow any single thread's
logic. The -ff flag, combined with -o, solves this:
# One output file per thread/child: prefix.PID for each
strace -ff -o /tmp/trace ./myapp
# Produces: /tmp/trace.12345, /tmp/trace.12346, /tmp/trace.12347, ...
Each file contains only that thread's syscalls in order. This is essential for debugging multi-threaded services — without it, you are reading a shuffled deck of cards from multiple conversations.
Use -ff when: the target has more than 2-3 threads, or when you need
to trace one specific thread's behavior through a sequence of calls.
9. Containers and Production¶
strace works the same way in containers — it traces syscalls via ptrace. The complication is finding the right PID and having the right permissions.
The fundamental pattern: strace runs on the host, targeting the
host-namespace PID of the container process. This is the same
strace -p PID you would use anywhere — the container is irrelevant
at the kernel level.
# Step 1: Find the host PID of the container's main process
# Docker:
PID=$(docker inspect --format '{{.State.Pid}}' <container>)
# containerd/crictl:
PID=$(crictl inspect <container-id> | jq '.info.pid')
# Step 2: Trace from the host (same as any process)
sudo strace -fp $PID -e trace=network -o /tmp/trace.log
When you cannot trace from the host (shared hosting, managed k8s without node access), you need strace inside the container:
# Option A: Container has strace and SYS_PTRACE capability
# (add SYS_PTRACE to securityContext.capabilities.add in the pod spec)
kubectl exec -it <pod> -- strace -p 1 -e trace=file
# Option B: Ephemeral debug container (K8s 1.23+)
kubectl debug -it <pod> --image=ubuntu --target=<container>
# Inside: apt-get update && apt-get install -y strace
# Then: strace -p 1 -e trace=network
Production overhead considerations:
- ptrace stops the target process on every traced syscall — on a service doing 50K syscalls/sec, this is measurable (5-20% overhead)
- Always filter with
-e trace=to reduce the set of intercepted calls - Always write to a file with
-o— terminal output adds extra latency - Keep trace duration short — 10-30 seconds is usually enough
- For lower overhead, use
perf traceinstead (uses kernel tracepoints rather than ptrace, ~2-5x less overhead, but less argument detail)
What Experienced People Know¶
- grep for
ENOENTfirst — missing files cause a huge proportion of failures - The
-cflag gives a syscall summary with counts and time, useful for quick profiling before diving into full traces - File descriptor numbers are stable within a trace — you can follow fd 3 from
connect()throughwrite()andread()to trace one connection's lifecycle strace -e trace=filecatchesopenat,stat,access, and related calls, not justopen— the filter categories are broader than individual syscall names- Repeated
futex(FUTEX_WAIT, ...)orepoll_wait(...)in output usually means the process is idle or blocked, not broken — that is normal event-loop behavior - When a process hangs, attach with
strace -pand the last syscall shown is usually exactly what it is stuck on (DNS lookup, lock acquisition, network read) - strace works on statically linked Go binaries, containers, and JVM processes equally well — it operates at the kernel boundary, below all language runtimes
- Use
-yto annotate file descriptors with paths (e.g.,read(3</etc/hosts>)) which saves manual fd-to-file mapping - Use
-ff -o prefixwhen tracing multi-threaded programs — writes one file per thread, making output far easier to read
Interview tip: "How would you debug a process that hangs with no log output?" is a classic interview question. The strongest answer: "Attach with
strace -p <pid>— the last syscall shown is exactly what the process is stuck on. If it isconnect(), it is waiting on a network connection. If it isread()on a socket fd, it is waiting for a response. If it isfutex(FUTEX_WAIT), it is waiting on a lock." This shows you understand the kernel boundary and can diagnose without application-level tooling.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- OpenTelemetry (Topic Pack, L2) — Tracing
- Strace Flashcards (CLI) (flashcard_deck, L1) — strace
- Tracing (Topic Pack, L1) — Tracing
- Tracing Flashcards (CLI) (flashcard_deck, L1) — Tracing
- perf Profiling (Topic Pack, L2) — Tracing