Skip to content

Process Management Footguns

  1. Using SIGKILL as your first resort. You see a stuck process and immediately reach for kill -9. The process dies without flushing write buffers, closing database connections, releasing file locks, or removing PID files. Downstream services see broken connections. Data gets corrupted. Lock files prevent the service from restarting cleanly.

Fix: Always send SIGTERM first and wait at least 5-10 seconds. Only escalate to SIGKILL if the process does not respond. Script it: kill $PID; sleep 10; kill -0 $PID 2>/dev/null && kill -9 $PID. This is the same sequence Docker and Kubernetes use for a reason.

Remember: The signal escalation ladder: SIGTERM (15) asks politely, SIGINT (2) is like Ctrl-C, SIGHUP (1) tells daemons to reload config, SIGQUIT (3) asks for a core dump, SIGKILL (9) kills unconditionally. SIGKILL is the only signal that cannot be caught, blocked, or ignored — the kernel handles it directly. For Java, SIGQUIT triggers a thread dump (useful for debugging) without killing the process. For Go, SIGQUIT writes a goroutine dump to stderr.

  1. Ignoring zombie processes because they "use no resources." Technically true — zombies consume no CPU or memory. But each zombie holds a PID table entry. In a long-running container or system that spawns many short-lived processes, zombies accumulate until PID exhaustion. New processes cannot be created. The system grinds to a halt.

Fix: Fix the parent process so it properly reaps children with wait() or handles SIGCHLD. In containers, always use an init process (tini, dumb-init) as PID 1. Monitor zombie count: ps aux | awk '$8=="Z"' | wc -l and alert if it exceeds a threshold.

  1. Running nohup without redirecting both stdout and stderr. You run nohup ./script.sh & and log out. Output goes to nohup.out in whatever directory you happened to be in. If that was /tmp, the file gets cleaned up. If it was /, the root filesystem fills up. Stderr may not be captured at all, so errors vanish silently.

Fix: Always redirect explicitly: nohup ./script.sh > /var/log/task.log 2>&1 &. Better yet, use systemd-run or tmux for any task that needs to survive logout. Save the PID: echo $! > /var/run/task.pid.

  1. Not understanding D-state processes. You see a process stuck in D-state and try to kill it. kill -9 does nothing. You try again. Still nothing. You escalate, waste time, maybe even reboot prematurely. D-state means uninterruptible sleep — the process is waiting for a kernel I/O operation and cannot receive any signal, including SIGKILL.

Fix: Identify what I/O the process is waiting on: cat /proc/PID/stack and cat /proc/PID/wchan. Common culprits: dead NFS servers, failing disks, hung FUSE mounts. Fix the underlying I/O issue. The process will resume or die on its own once the I/O completes or times out.

Under the hood: D-state (TASK_UNINTERRUPTIBLE) exists because the kernel cannot safely interrupt certain operations — the process may hold a lock on kernel data structures, or an I/O operation may be mid-flight. Killing it would corrupt kernel state. Linux 4.14 introduced TASK_KILLABLE (D-state that responds to SIGKILL only) for some I/O paths like NFS, but many disk I/O paths remain fully uninterruptible. cat /proc/PID/stack shows the exact kernel function where the process is stuck.

  1. Ignoring file descriptor limits until you hit them. Your service runs fine in dev where it handles 10 connections. In production it handles 10,000. At some point, every new connection fails with "Too many open files." The default ulimit for open files is often 1024 — far too low for any production service.

Fix: Check limits proactively: cat /proc/PID/limits | grep "open files" and compare against ls /proc/PID/fd | wc -l. Set appropriate limits in systemd unit files (LimitNOFILE=65536) or /etc/security/limits.conf. Monitor fd counts as a metric.

  1. Killing a parent without considering its children. You kill a misbehaving supervisor process. Its 50 child workers are now orphaned, still running, still consuming resources, still holding connections. They get reparented to PID 1 but nobody is managing them. If they crash, nobody restarts them.

Fix: Kill the entire process group: kill -TERM -$PGID (note the negative sign to target the group). Or use pkill -TERM -P $PPID to kill all children first, then the parent. Verify with pstree -p $PID before and after.

  1. Sending SIGTERM to a bash script and expecting children to die. You SIGTERM a wrapper script. The shell process dies. But the curl, sleep, or python commands it was running continue as orphaned processes. The script's trap handler never fires because it was not set up, or it did not explicitly kill child processes.

Fix: In wrapper scripts, always set a trap that kills child processes: trap 'kill $(jobs -p) 2>/dev/null; exit 1' SIGTERM SIGINT. Run long-running children in the background with &, then use wait to block. This allows the trap to fire between operations.

  1. Assuming deleted files free disk space immediately. You delete a 50GB log file, but df shows no change. A process still has the file open. The kernel will not free the disk space until every file descriptor pointing to that file is closed. This can cause "disk full" conditions that persist even after you think you have cleaned up.

Fix: Check for deleted-but-open files: lsof +L1. Either restart the process holding the file descriptor, or truncate the descriptor directly: : > /proc/PID/fd/N. For log files, use logrotate with copytruncate or signal-based rotation so the process reopens the file.

Debug clue: lsof +L1 lists all files with a link count of 0 (deleted but still open). The SIZE column shows how much disk space each deleted file is consuming. Sort by size to find the biggest offenders: lsof +L1 | awk 'NR>1 {print $7, $1, $9}' | sort -rn | head. On a system with "mysterious" disk usage where du and df disagree, deleted-but-open files are almost always the cause.

  1. Using ps aux output to parse PIDs programmatically. You write a script that does ps aux | grep myapp | awk '{print $2}' | xargs kill. This is fragile: it matches itself (the grep process), it matches unrelated processes with "myapp" in their args, and field positions can shift with long usernames. One day it kills the wrong process.

Fix: Use pgrep and pkill with specific flags: pgrep -f "myapp --config" or pkill -TERM -f "myapp --config". For exact matches: pgrep -x myapp. For children of a specific parent: pkill -P $PPID. These tools are purpose-built and avoid grep pipeline fragility.

  1. Not using a proper init in containers. Your Dockerfile has CMD ["python", "app.py"]. Python becomes PID 1. It does not handle SIGCHLD, so zombie children accumulate. It may not even forward SIGTERM properly to its own threads. Graceful shutdown does not work. Kubernetes sends SIGTERM, nothing happens, then SIGKILL after 30 seconds — every deployment.

    Fix: Use a lightweight init: ENTRYPOINT ["tini", "--"] or run with Docker's --init flag. This adds a proper PID 1 that reaps zombies and forwards signals. For Kubernetes, ensure terminationGracePeriodSeconds gives the app time to shut down after receiving the forwarded SIGTERM.