strace: Reading the Matrix
- lesson
- system-calls
- strace
- debugging
- file-i/o
- network
- process-lifecycle
- l2 ---# strace: Reading the Matrix
Topics: system calls, strace, debugging, file I/O, network, process lifecycle Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic Linux command line
The Mission¶
Your application is stuck. It's not crashing — the process is running, CPU is low, no errors in the log. It's just... doing nothing. Or it's slow and you can't figure out why. The application logs tell you nothing because the problem is below the application — in the system calls between your code and the kernel.
strace lets you see every system call a process makes: every file it opens, every byte
it reads and writes, every network connection it creates, every signal it receives. It's
like reading the conversation between your application and the operating system.
What System Calls Are¶
Your application runs in userspace. It can't directly access hardware, files, or the network. For that, it asks the kernel via system calls (syscalls).
Your code: open("config.yaml", O_RDONLY)
↓ trap to kernel
Kernel: Find file on disk, check permissions, allocate file descriptor
↓ return to userspace
Your code: fd = 3 (success)
Every I/O operation, every file access, every network connection, every process creation
goes through a syscall. strace intercepts and prints them all.
# See what syscalls a command makes
strace ls /tmp
# → execve("/usr/bin/ls", ["ls", "/tmp"], ...) = 0
# → openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY) = 3
# → read(3, "\177ELF...", 832) = 832
# → openat(AT_FDCWD, "/tmp", O_RDONLY|O_DIRECTORY) = 3
# → getdents64(3, ..., 32768) = 520
# → write(1, "file1.txt\nfile2.txt\n", 20) = 20
Even a simple ls makes dozens of syscalls: load shared libraries, open the directory,
read entries, write output.
The Essential strace Commands¶
Trace a running process¶
# Attach to a running process by PID
strace -p 12345
# With timestamps and timing
strace -p 12345 -t -T
# -t = timestamp each line
# -T = time spent in each syscall (in angle brackets at end)
# → 14:23:01 read(5, "...", 4096) = 4096 <0.000023>
# ↑ time ↑ syscall ↑ returned ↑ 23 microseconds
Trace a command from start¶
# Trace everything
strace ./myapp
# Trace a specific syscall
strace -e trace=open,read,write ./myapp
# Trace by category
strace -e trace=network ./myapp # socket, connect, send, recv, etc.
strace -e trace=file ./myapp # open, read, write, stat, etc.
strace -e trace=process ./myapp # fork, exec, wait, etc.
Follow child processes¶
# -f follows forks (critical for scripts that spawn children)
strace -f bash -c './deploy.sh'
# -ff = separate output file per child PID
strace -ff -o /tmp/trace ./myapp
# Creates /tmp/trace.12345, /tmp/trace.12346, etc.
Pattern 1: The Stuck Process¶
Your process is alive but doing nothing. strace shows what it's waiting for:
strace -p 12345
# → read(5, [hangs here — no output]
# The process is blocked on read() from file descriptor 5
# What is fd 5?
ls -la /proc/12345/fd/5
# → lrwx------ 1 app app 64 Mar 22 14:23 5 -> socket:[89012]
# It's a socket. Which one?
ss -p | grep 12345
# → tcp ESTABLISHED 10.0.1.50:8080 → 10.0.2.100:5432
# → It's waiting on a response from the database
The process is blocked on a database query. The database is slow. The application log doesn't mention this because the query hasn't returned yet.
Pattern 2: The Slow Startup¶
Application takes 30 seconds to start. strace shows where the time goes:
strace -T -e trace=open,connect,stat ./myapp 2>&1 | sort -t'<' -k2 -rn | head -10
# → connect(3, {sa_family=AF_INET, sin_port=htons(5432), ...}) = 0 <5.012345>
# ↑ 5 seconds connecting to PostgreSQL!
# → openat(AT_FDCWD, "/etc/ssl/certs/ca-certificates.crt"...) = 3 <0.523456>
# ↑ 500ms loading certificate bundle
# → stat("/opt/myapp/locales/en_US.UTF-8"...) = -1 ENOENT <0.000012>
# → stat("/opt/myapp/locales/en_US"...) = -1 ENOENT <0.000011>
# → stat("/opt/myapp/locales/en"...) = -1 ENOENT <0.000010>
# ↑ Hundreds of failed locale lookups (fast individually, slow in aggregate)
The -T flag reveals timing. Sort by time in angle brackets to find the slowest calls.
Pattern 3: The Permission Problem¶
Application fails with a vague error. strace shows exactly which file or operation failed:
strace -e trace=open,stat,access ./myapp 2>&1 | grep EACCES
# → openat(AT_FDCWD, "/var/lib/myapp/data.db", O_RDWR) = -1 EACCES (Permission denied)
# Now you know EXACTLY which file, which operation, which error
# Common error codes
# EACCES = permission denied
# ENOENT = file not found
# ECONNREFUSED = connection refused
# ETIMEDOUT = connection timed out
# EAGAIN = resource temporarily unavailable (non-blocking I/O)
# EPIPE = broken pipe (writing to closed connection)
Pattern 4: The Network Debug¶
What is this app connecting to? strace shows every connection:
strace -e trace=network -f ./myapp 2>&1 | grep connect
# → connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("10.0.2.50")}) = 0
# → connect(4, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.0.1.20")}) = 0
# → connect(5, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.0.1.30")}) = -1 ECONNREFUSED
# The app connects to:
# - HTTPS (443) on 10.0.2.50
# - PostgreSQL (5432) on 10.0.1.20
# - Redis (6379) on 10.0.1.30 ← CONNECTION REFUSED
Pattern 5: What Files Does This App Touch?¶
# See every file access
strace -e trace=openat -f ./myapp 2>&1 | grep -v ENOENT | head -30
# → openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = 3
# → openat(AT_FDCWD, "/var/lib/myapp/data.db", O_RDWR|O_CREAT) = 4
# → openat(AT_FDCWD, "/tmp/myapp-cache-abc123", O_RDWR|O_CREAT|O_TRUNC) = 5
# Now you know: config from /etc/myapp/, data in /var/lib/myapp/, cache in /tmp/
This is invaluable for debugging containers — "what files does this app need?" tells you exactly what volumes to mount.
The Performance Warning¶
Gotcha: strace uses
ptraceto intercept every syscall. This adds massive overhead — a high-throughput process (100,000 syscalls/sec) can slow down 10-100x under strace. Never strace a production process that handles live traffic unless you're desperate and accept the performance hit.
Alternatives for production:
| Tool | Overhead | How it works |
|---|---|---|
| strace | 100x+ | ptrace (stops process per syscall) |
| perf trace | <2% | Kernel tracepoints (no stopping) |
| bpftrace | <5% | eBPF programs in kernel |
# Production-safe syscall tracing
perf trace -p 12345 --duration 10
# 10 seconds of syscall tracing with minimal overhead
Name Origin:
stracestands for "system call trace." It was originally written for SunOS by Paul Kranenburg, then rewritten for Linux by Branko Lankester in 1991. Theptracesyscall it uses was designed for debuggers (GDB), not performance analysis — which is why the overhead is so high.
Flashcard Check¶
Q1: strace -p 12345 shows read(5, and hangs. What's happening?
The process is blocked waiting for data on file descriptor 5. Check what fd 5 is:
ls -la /proc/12345/fd/5. It's probably a socket (waiting on a slow backend).
Q2: How do you find which syscalls are slowest?
strace -Tadds timing in angle brackets. Sort by the timing field:strace -T ./app 2>&1 | sort -t'<' -k2 -rn | head
Q3: Why shouldn't you strace a production process?
ptraceoverhead can slow a process 10-100x. Useperf traceorbpftracefor production — they use kernel tracepoints with <5% overhead.
Q4: openat(...) = -1 EACCES — what does this mean?
The process tried to open a file but was denied permission.
EACCES= permission denied. The strace output shows exactly which file and what operation was attempted.
Q5: strace -f — what does -f do?
Follows child processes (fork/exec). Without it, strace only traces the initial process. Critical for scripts that spawn children.
Exercises¶
Exercise 1: Trace a command (hands-on)¶
# See what syscalls curl makes
strace -e trace=network,write curl -s http://example.com -o /dev/null
# Count the syscalls: how many connect(), send(), recv()?
# See what files Python loads on startup
strace -e trace=openat python3 -c "import json" 2>&1 | wc -l
# How many files does Python open just to import json?
Exercise 2: Find the slow syscall (hands-on)¶
# Time all syscalls for a command
strace -c ls -R /usr 2>&1 | tail -15
# -c gives a summary: total time, calls, errors per syscall
# Which syscall takes the most time?
Exercise 3: Debug a permission error (hands-on)¶
# Create a file you can't read
echo "secret" > /tmp/strace-test && chmod 000 /tmp/strace-test
strace cat /tmp/strace-test 2>&1 | grep EACCES
# What file? What operation? What error?
chmod 644 /tmp/strace-test && rm /tmp/strace-test
Cheat Sheet¶
| Task | Command |
|---|---|
| Trace running process | strace -p PID |
| With timing | strace -p PID -T |
| With timestamps | strace -p PID -t |
| Follow children | strace -f ./app |
| Network calls only | strace -e trace=network ./app |
| File access only | strace -e trace=file ./app |
| Process creation only | strace -e trace=process ./app |
| Summary (count + time) | strace -c ./app |
| Write to file | strace -o /tmp/trace.out ./app |
| Production-safe | perf trace -p PID --duration 10 |
Common Error Codes¶
| Code | Meaning |
|---|---|
| EACCES | Permission denied |
| ENOENT | File not found |
| ECONNREFUSED | Connection refused |
| ETIMEDOUT | Connection timed out |
| EAGAIN | Resource temporarily unavailable |
| EPIPE | Broken pipe (writing to closed connection) |
| ENOMEM | Out of memory |
Takeaways¶
-
strace shows the conversation between your app and the kernel. When application logs tell you nothing, strace shows what the process is actually doing.
-
-Treveals timing. The slowest syscall is your bottleneck. Sort by timing to find it immediately. -
File descriptor numbers map to real resources.
ls /proc/PID/fd/Ntells you if fd 5 is a file, socket, pipe, or device. -
Don't strace production. The overhead is 10-100x. Use
perf traceorbpftracefor live systems. -
strace is the fastest way to understand unknown software. "What does this app connect to? What files does it read? What signals does it handle?" — strace answers all of these in seconds.
Related Lessons¶
- The Mysterious Latency Spike — strace as one tool in the latency debugging toolkit
- Permission Denied — when strace shows EACCES on unexpected files
- The Hanging Deploy — when processes get stuck