Skip to content

strace: Reading the Matrix

  • lesson
  • system-calls
  • strace
  • debugging
  • file-i/o
  • network
  • process-lifecycle
  • l2 ---# strace: Reading the Matrix

Topics: system calls, strace, debugging, file I/O, network, process lifecycle Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic Linux command line


The Mission

Your application is stuck. It's not crashing — the process is running, CPU is low, no errors in the log. It's just... doing nothing. Or it's slow and you can't figure out why. The application logs tell you nothing because the problem is below the application — in the system calls between your code and the kernel.

strace lets you see every system call a process makes: every file it opens, every byte it reads and writes, every network connection it creates, every signal it receives. It's like reading the conversation between your application and the operating system.


What System Calls Are

Your application runs in userspace. It can't directly access hardware, files, or the network. For that, it asks the kernel via system calls (syscalls).

Your code:          open("config.yaml", O_RDONLY)
                         ↓ trap to kernel
Kernel:             Find file on disk, check permissions, allocate file descriptor
                         ↓ return to userspace
Your code:          fd = 3 (success)

Every I/O operation, every file access, every network connection, every process creation goes through a syscall. strace intercepts and prints them all.

# See what syscalls a command makes
strace ls /tmp
# → execve("/usr/bin/ls", ["ls", "/tmp"], ...) = 0
# → openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY) = 3
# → read(3, "\177ELF...", 832) = 832
# → openat(AT_FDCWD, "/tmp", O_RDONLY|O_DIRECTORY) = 3
# → getdents64(3, ..., 32768) = 520
# → write(1, "file1.txt\nfile2.txt\n", 20) = 20

Even a simple ls makes dozens of syscalls: load shared libraries, open the directory, read entries, write output.


The Essential strace Commands

Trace a running process

# Attach to a running process by PID
strace -p 12345

# With timestamps and timing
strace -p 12345 -t -T
# -t = timestamp each line
# -T = time spent in each syscall (in angle brackets at end)

# → 14:23:01 read(5, "...", 4096) = 4096 <0.000023>
#   ↑ time   ↑ syscall              ↑ returned  ↑ 23 microseconds

Trace a command from start

# Trace everything
strace ./myapp

# Trace a specific syscall
strace -e trace=open,read,write ./myapp

# Trace by category
strace -e trace=network ./myapp    # socket, connect, send, recv, etc.
strace -e trace=file ./myapp       # open, read, write, stat, etc.
strace -e trace=process ./myapp    # fork, exec, wait, etc.

Follow child processes

# -f follows forks (critical for scripts that spawn children)
strace -f bash -c './deploy.sh'

# -ff = separate output file per child PID
strace -ff -o /tmp/trace ./myapp
# Creates /tmp/trace.12345, /tmp/trace.12346, etc.

Pattern 1: The Stuck Process

Your process is alive but doing nothing. strace shows what it's waiting for:

strace -p 12345
# → read(5, [hangs here — no output]

# The process is blocked on read() from file descriptor 5
# What is fd 5?
ls -la /proc/12345/fd/5
# → lrwx------ 1 app app 64 Mar 22 14:23 5 -> socket:[89012]

# It's a socket. Which one?
ss -p | grep 12345
# → tcp ESTABLISHED 10.0.1.50:8080 → 10.0.2.100:5432
# → It's waiting on a response from the database

The process is blocked on a database query. The database is slow. The application log doesn't mention this because the query hasn't returned yet.


Pattern 2: The Slow Startup

Application takes 30 seconds to start. strace shows where the time goes:

strace -T -e trace=open,connect,stat ./myapp 2>&1 | sort -t'<' -k2 -rn | head -10
# → connect(3, {sa_family=AF_INET, sin_port=htons(5432), ...}) = 0 <5.012345>
#   ↑ 5 seconds connecting to PostgreSQL!
# → openat(AT_FDCWD, "/etc/ssl/certs/ca-certificates.crt"...) = 3 <0.523456>
#   ↑ 500ms loading certificate bundle
# → stat("/opt/myapp/locales/en_US.UTF-8"...) = -1 ENOENT <0.000012>
# → stat("/opt/myapp/locales/en_US"...) = -1 ENOENT <0.000011>
# → stat("/opt/myapp/locales/en"...) = -1 ENOENT <0.000010>
#   ↑ Hundreds of failed locale lookups (fast individually, slow in aggregate)

The -T flag reveals timing. Sort by time in angle brackets to find the slowest calls.


Pattern 3: The Permission Problem

Application fails with a vague error. strace shows exactly which file or operation failed:

strace -e trace=open,stat,access ./myapp 2>&1 | grep EACCES
# → openat(AT_FDCWD, "/var/lib/myapp/data.db", O_RDWR) = -1 EACCES (Permission denied)
#   Now you know EXACTLY which file, which operation, which error
# Common error codes
# EACCES  = permission denied
# ENOENT  = file not found
# ECONNREFUSED = connection refused
# ETIMEDOUT = connection timed out
# EAGAIN  = resource temporarily unavailable (non-blocking I/O)
# EPIPE   = broken pipe (writing to closed connection)

Pattern 4: The Network Debug

What is this app connecting to? strace shows every connection:

strace -e trace=network -f ./myapp 2>&1 | grep connect
# → connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("10.0.2.50")}) = 0
# → connect(4, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.0.1.20")}) = 0
# → connect(5, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.0.1.30")}) = -1 ECONNREFUSED

# The app connects to:
# - HTTPS (443) on 10.0.2.50
# - PostgreSQL (5432) on 10.0.1.20
# - Redis (6379) on 10.0.1.30 ← CONNECTION REFUSED

Pattern 5: What Files Does This App Touch?

# See every file access
strace -e trace=openat -f ./myapp 2>&1 | grep -v ENOENT | head -30
# → openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = 3
# → openat(AT_FDCWD, "/var/lib/myapp/data.db", O_RDWR|O_CREAT) = 4
# → openat(AT_FDCWD, "/tmp/myapp-cache-abc123", O_RDWR|O_CREAT|O_TRUNC) = 5

# Now you know: config from /etc/myapp/, data in /var/lib/myapp/, cache in /tmp/

This is invaluable for debugging containers — "what files does this app need?" tells you exactly what volumes to mount.


The Performance Warning

Gotcha: strace uses ptrace to intercept every syscall. This adds massive overhead — a high-throughput process (100,000 syscalls/sec) can slow down 10-100x under strace. Never strace a production process that handles live traffic unless you're desperate and accept the performance hit.

Alternatives for production:

Tool Overhead How it works
strace 100x+ ptrace (stops process per syscall)
perf trace <2% Kernel tracepoints (no stopping)
bpftrace <5% eBPF programs in kernel
# Production-safe syscall tracing
perf trace -p 12345 --duration 10
# 10 seconds of syscall tracing with minimal overhead

Name Origin: strace stands for "system call trace." It was originally written for SunOS by Paul Kranenburg, then rewritten for Linux by Branko Lankester in 1991. The ptrace syscall it uses was designed for debuggers (GDB), not performance analysis — which is why the overhead is so high.


Flashcard Check

Q1: strace -p 12345 shows read(5, and hangs. What's happening?

The process is blocked waiting for data on file descriptor 5. Check what fd 5 is: ls -la /proc/12345/fd/5. It's probably a socket (waiting on a slow backend).

Q2: How do you find which syscalls are slowest?

strace -T adds timing in angle brackets. Sort by the timing field: strace -T ./app 2>&1 | sort -t'<' -k2 -rn | head

Q3: Why shouldn't you strace a production process?

ptrace overhead can slow a process 10-100x. Use perf trace or bpftrace for production — they use kernel tracepoints with <5% overhead.

Q4: openat(...) = -1 EACCES — what does this mean?

The process tried to open a file but was denied permission. EACCES = permission denied. The strace output shows exactly which file and what operation was attempted.

Q5: strace -f — what does -f do?

Follows child processes (fork/exec). Without it, strace only traces the initial process. Critical for scripts that spawn children.


Exercises

Exercise 1: Trace a command (hands-on)

# See what syscalls curl makes
strace -e trace=network,write curl -s http://example.com -o /dev/null
# Count the syscalls: how many connect(), send(), recv()?

# See what files Python loads on startup
strace -e trace=openat python3 -c "import json" 2>&1 | wc -l
# How many files does Python open just to import json?

Exercise 2: Find the slow syscall (hands-on)

# Time all syscalls for a command
strace -c ls -R /usr 2>&1 | tail -15
# -c gives a summary: total time, calls, errors per syscall
# Which syscall takes the most time?

Exercise 3: Debug a permission error (hands-on)

# Create a file you can't read
echo "secret" > /tmp/strace-test && chmod 000 /tmp/strace-test
strace cat /tmp/strace-test 2>&1 | grep EACCES
# What file? What operation? What error?
chmod 644 /tmp/strace-test && rm /tmp/strace-test

Cheat Sheet

Task Command
Trace running process strace -p PID
With timing strace -p PID -T
With timestamps strace -p PID -t
Follow children strace -f ./app
Network calls only strace -e trace=network ./app
File access only strace -e trace=file ./app
Process creation only strace -e trace=process ./app
Summary (count + time) strace -c ./app
Write to file strace -o /tmp/trace.out ./app
Production-safe perf trace -p PID --duration 10

Common Error Codes

Code Meaning
EACCES Permission denied
ENOENT File not found
ECONNREFUSED Connection refused
ETIMEDOUT Connection timed out
EAGAIN Resource temporarily unavailable
EPIPE Broken pipe (writing to closed connection)
ENOMEM Out of memory

Takeaways

  1. strace shows the conversation between your app and the kernel. When application logs tell you nothing, strace shows what the process is actually doing.

  2. -T reveals timing. The slowest syscall is your bottleneck. Sort by timing to find it immediately.

  3. File descriptor numbers map to real resources. ls /proc/PID/fd/N tells you if fd 5 is a file, socket, pipe, or device.

  4. Don't strace production. The overhead is 10-100x. Use perf trace or bpftrace for live systems.

  5. strace is the fastest way to understand unknown software. "What does this app connect to? What files does it read? What signals does it handle?" — strace answers all of these in seconds.


  • The Mysterious Latency Spike — strace as one tool in the latency debugging toolkit
  • Permission Denied — when strace shows EACCES on unexpected files
  • The Hanging Deploy — when processes get stuck