strace Footguns¶

Mistakes that add unacceptable production overhead, produce unusable output, or lead to wrong conclusions.

1. Leaving strace Attached to a Production Process for Too Long¶

You attach strace -fp $PID to a production service to debug intermittent errors. You get distracted and leave it running for an hour. strace uses ptrace, which stops the target process on every single syscall. On a service making 50,000 syscalls/second, each stop adds microseconds of latency. The aggregate overhead is 10-30%. Response times double. Monitoring alerts fire. You are now causing the performance problem you were investigating.

Fix: Always limit trace duration. Use timeout:

timeout 10 strace -fp $PID -e trace=network -o /tmp/trace.log

10-30 seconds is almost always enough. Filter with -e trace= to reduce the syscall set being intercepted. For lower-overhead tracing, use perf trace instead of strace (2-5x less overhead because it uses kernel tracepoints instead of ptrace).

2. Not Filtering with -e trace= on a Busy Process¶

You run strace -p $PID without any filter. The process makes 50,000 syscalls per second across dozens of syscall types. The trace output scrolls too fast to read. The output file grows at 10 MB/second. You grep through megabytes of noise looking for the one ECONNREFUSED error. You waste 30 minutes.

Fix: Always filter to the category you care about:

# File issues:     strace -e trace=file -p $PID
# Network issues:  strace -e trace=network -p $PID
# Process issues:  strace -e trace=process -p $PID
# Specific calls:  strace -e trace=openat,connect,write -p $PID

The -e trace=file filter catches openat, stat, access, unlink, and all related file syscalls — it is broader than listing individual syscalls. Start narrow, widen only if needed.

3. Interpreting epoll_wait or futex as a Problem¶

You attach strace to a service and see hundreds of lines of epoll_wait(3, [], 1024, -1) or futex(0x..., FUTEX_WAIT_PRIVATE, ...). You conclude the process is stuck or broken. It is not — this is normal idle behavior. Event-driven programs (nginx, Node.js, Go services, Python asyncio) spend most of their time in epoll_wait waiting for the next event. Threaded programs waiting on condition variables or mutexes sit in futex.

Fix: epoll_wait and futex(FUTEX_WAIT) are healthy when the process is idle. They become interesting only when they dominate time during active request processing. Use -c to get a summary and compare active periods vs idle:

# Compare idle vs busy: capture during a burst of traffic
timeout 5 strace -cp $PID 2>&1
# If epoll_wait is 99% of time → process is idle (normal)
# If epoll_wait is 50% during high load → interesting, something is slow

4. Not Using -f to Follow Child Processes¶

You trace a Python web server: strace -p $PID. The main process is the master that spawns workers. The master does almost nothing — it just manages workers. Your trace shows only epoll_wait because you are tracing the wrong process. The actual request handling happens in child workers.

Fix: Use -f to follow forks and child processes:

strace -fp $PID -e trace=network -o /tmp/trace.log

For multi-threaded applications, use -ff -o /tmp/trace to write one file per thread. This separates interleaved syscalls from different threads into readable per-thread files: /tmp/trace.12345, /tmp/trace.12346, etc.

5. Truncated String Output Making Traces Unreadable¶

You trace what a process reads from a config file. The output shows: read(3, "database:\n host: db.int"..., 4096) = 128. The string is truncated at 32 characters (the default). You cannot see the full config being read, the full error message being written, or the full URL being requested.

Fix: Use -s to increase the string length limit:

# Show up to 256 characters of string arguments
strace -s 256 -e trace=read,write -p $PID

For very large reads/writes (HTTP responses, file contents), use -s 4096 or even -s 65535. Be aware that very large string limits produce enormous trace files on busy processes.

6. Forgetting -y and Manually Tracking File Descriptors¶

Your trace shows read(7, "...", 4096) = 128. What is file descriptor 7? You scroll back through thousands of lines looking for the openat() or socket() call that created fd 7. Five minutes later, you find it was a TCP connection to the database.

Fix: Use the -y flag to annotate file descriptors with their paths:

strace -y -e trace=read,write -p $PID 2>&1 | head -10
# read(7</var/run/postgresql/.s.PGSQL.5432>, "...", 4096) = 128
# write(9<TCP:[10.0.1.1:42318->10.0.1.50:443]>, "GET /api...", 256) = 256

The -y flag is available in strace 4.7+ and saves significant debugging time on any non-trivial trace.

7. Tracing a Process You Do Not Own Without sudo¶

You run strace -p 12345 on a process owned by another user. You get: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted. You try running your application under strace, but it is launched by systemd and you cannot easily modify the service file.

Fix: Use sudo to attach to any process. For systemd-managed services:

# Attach to the running service
sudo strace -fp $(systemctl show myapp --property MainPID --value) \
  -e trace=network -o /tmp/trace.log

# Or add strace to the service temporarily:
# systemctl edit myapp → add:
# [Service]
# ExecStart=/usr/bin/strace -f -o /tmp/trace.log /usr/bin/myapp
# Then: systemctl restart myapp
# Remember to remove the override when done!

If you cannot get root, you can trace your own processes or check kernel.yama.ptrace_scope:

cat /proc/sys/kernel/yama/ptrace_scope
# 0 = any process can trace any other (same user)
# 1 = only parent can trace child (default on Ubuntu)
# 2 = only admin can trace

8. Confusing strace Overhead with Application Problems¶

You attach strace to a service to investigate slow responses. The service becomes even slower. You conclude the problem is getting worse. In reality, the additional slowdown is entirely from strace's ptrace overhead. You waste time investigating a phantom regression that disappears when you detach strace.

Fix: Understand that strace adds 5-20% overhead for typical workloads, and up to 50% for syscall-heavy processes. Establish a baseline: measure response time without strace, then measure with strace. The difference is strace overhead. For production tracing with minimal overhead, use: - perf trace (kernel tracepoints, 2-5x less overhead) - bpftrace (eBPF, even lower overhead) - strace -c (summary mode, much less per-syscall overhead than full trace)

9. Not Checking the Return Value and Only Reading the Syscall Name¶

You see connect(4, {AF_INET, sin_port=htons(5432), sin_addr=10.0.1.50}, 16) = 0 and conclude the database connection works. Later, you see connect(5, {AF_INET, sin_port=htons(6379), sin_addr=10.0.1.51}, 16) = -1 ECONNREFUSED and miss it because you were scanning for connect not ECONNREFUSED. The return value is the most important part of every strace line.

Fix: Always read return values. The pattern is: syscall(args) = return_value. A return of -1 means failure, and the errno name tells you why: - ENOENT — file not found - EACCES — permission denied - ECONNREFUSED — target not listening - ETIMEDOUT — connection timed out (firewall or network issue)

Grep for errors specifically: grep ' = -1 ' trace.log or use strace's -Z flag (5.2+) to only show failed syscalls.

10. Tracing the Wrong PID in a Container¶

You trace PID 1 inside a Kubernetes pod using kubectl exec. PID 1 is the container's init process (tini, dumb-init, or the shell). The actual application is PID 7 or PID 15. Your trace shows only the init process sitting in wait4(), doing nothing interesting.

Fix: Identify the correct PID inside the container:

kubectl exec myapp -- ps aux
# Find the actual application process PID

# Or trace from the host (more reliable):
PID=$(crictl inspect $(crictl ps --name myapp -q) | jq '.info.pid')
sudo strace -fp $PID -e trace=network

When tracing from inside the container, the container needs SYS_PTRACE capability. Use -f to follow child processes from the init process.

11. Writing Trace Output to the Terminal Instead of a File¶

You run strace -fp $PID and the output floods your terminal at thousands of lines per second. You try to scroll back but the terminal buffer is full. You press Ctrl+C to stop, losing everything. The terminal display also adds latency — writing to a TTY is slower than writing to a file, which increases strace's overhead on the target process.

Fix: Always write to a file with -o:

sudo strace -fp $PID -e trace=network -o /tmp/trace.log
# Then analyze the file separately:
grep ECONNREFUSED /tmp/trace.log
tail -20 /tmp/trace.log

Writing to a file also preserves the complete trace for later analysis. If you need real-time viewing, write to a file and tail it: tail -f /tmp/trace.log | grep ENOENT.

12. Assuming strace Shows Everything a Process Does¶

You trace a process and see no file operations. You conclude the process does not read any files. In reality, the process uses memory-mapped files (mmap) that were opened before you attached strace. Or the process uses io_uring for async I/O, which strace does not fully trace. Or the process communicates via shared memory, which involves no syscalls after the initial mmap.

Fix: strace only shows syscalls from the moment you attach. It does not show files already open, memory already mapped, or sockets already established. To see the full picture: check /proc/$PID/fd for open file descriptors, /proc/$PID/maps for memory mappings, and ss -p for network connections. For complete visibility from process start, trace from launch: strace -f ./myapp rather than attaching mid-flight.