Linux Debugging¶

12 cards — 🟢 4 easy | 🟡 6 medium | 🔴 2 hard

🟢 Easy (4)¶

1. Why is dstat a good first tool when a machine is misbehaving?

Show answer

dstat shows CPU, disk, network, and memory stats updating in real time in a single view. It quickly answers "is the machine CPU-bound, disk-bound, memory-bound, or network-bound?" without switching between tools.

2. What three questions can lsof answer?

Show answer

1. What files is this process holding open? (lsof -p )
2. Who is listening on this port? (lsof -i :)
3. Why won't this filesystem unmount? (lsof /mount/point)

3. How do you list all TCP connections with process info using ss?

Show answer

ss -tp shows established TCP connections with the owning process. Add -l for listening sockets (ss -tlnp), -u for UDP (ss -unp).

4. When should you check dmesg during debugging and what would you look for?

Show answer

Check dmesg when processes are killed unexpectedly (OOM killer messages), hardware errors are suspected (disk I/O errors, NIC failures), or containers crash without application logs. Look for: "Out of memory: Killed process", "I/O error", "segfault at", or "hardware error".

🟡 Medium (6)¶

1. What kinds of problems is strace best at revealing?

Show answer

File access errors (ENOENT, EACCES), network connection failures, missing libraries, permission problems, signal delivery, and slow syscalls. It shows every syscall a process makes with arguments and return values.

2. What does perf help you understand that top does not?

Show answer

perf shows WHERE CPU time is going at the function level (hot functions, call stacks) and can trace kernel and user-space events. top only shows per-process CPU percentage.

3. A process is leaking file descriptors. How would you confirm this and find what it is opening?

Show answer

Check /proc//fd — each entry is a symlink to an open file or socket. Count them over time (ls /proc//fd | wc -l) to confirm growth. Read the symlink targets to see what is being leaked (ls -la /proc//fd). lsof -p gives the same data with more detail.

4. A service is slow but top shows low CPU usage. What does that tell you and what do you check next?

Show answer

Low CPU with high latency means the process is waiting, not computing. It is likely I/O-bound or blocked on a lock/network call. Check: iostat (disk I/O saturation), ss -tp (connection state — many CLOSE_WAIT or SYN_SENT?), strace -p -e trace=network (what syscall is it stuck on?). If strace shows futex or poll, the process is idle waiting for something external.

5. What do vmstat and iostat show that top does not?

Show answer

vmstat shows memory, swap, I/O, and CPU stats per interval (reveals swapping and I/O wait). iostat shows per-device disk I/O statistics (throughput, queue depth, utilization per disk).

6. How do you find which process is preventing a port from being reused?

Show answer

lsof -i : shows all processes with that port open. This reveals zombie listeners, processes in TIME_WAIT, or unexpected services that grabbed the port first.

🔴 Hard (2)¶

1. What are common debugging mistakes that Linux tools can prevent?

Show answer

Reaching for application logs only (instead of OS-level data), assuming "slow" means CPU (could be disk or network), assuming "network issue" without packet capture, ignoring /proc, and restarting services before gathering evidence.

2. You need to strace a production service handling 10K requests/sec. What precautions do you take?

Show answer

strace uses ptrace which stops the process on every traced syscall — at 10K req/s the overhead is severe. Precautions: (1) Filter aggressively with -e trace=network or -e trace=file to minimize intercepted calls. (2) Write to file with -o, never to terminal. (3) Limit duration to 10-30 seconds max. (4) Consider perf trace instead (kernel tracepoints, ~5x less overhead). (5) If possible, trace a single worker thread with -p rather than the whole process with -f.