Linux Signals & Process Control - Footguns¶

Using kill -9 as your first resort. You see a process misbehaving and immediately reach for kill -9. The process dies without flushing write buffers, closing database connections, releasing file locks, removing PID files, or deregistering from service discovery. Downstream services see broken connections. Data may be corrupted. Lock files block the service from restarting. Shared memory segments and temporary files are orphaned.

Remember: SIGTERM (15) = "please shut down gracefully." SIGKILL (9) = "kernel removes you immediately." SIGTERM can be caught, handled, or ignored. SIGKILL cannot be caught, handled, or ignored — it is processed by the kernel, never delivered to the process. That is why cleanup code never runs after kill -9.

Fix: Always send SIGTERM first. Wait 10-30 seconds (depending on the service's expected shutdown time). Only escalate to SIGKILL if the process does not exit. This is what Docker (docker stop), Kubernetes (pod termination), and systemd (systemctl stop) all do internally. Script the pattern:

kill $PID; sleep 15; kill -0 $PID 2>/dev/null && kill -9 $PID

Killing PID 1. On a bare-metal or VM system, killing PID 1 (init/systemd) triggers a kernel panic or immediate reboot. Inside a container, killing PID 1 terminates the entire container. Either way, everything running in that context dies immediately with no graceful shutdown.

Fix: Never target PID 1 directly. If you need to restart systemd services, use systemctl restart. If you need to restart a container's main process, restart the container. Double-check your target PID before sending any signal:

ps -p $PID -o pid,comm,args   # Verify before killing

Assuming SIGTERM always works. You send SIGTERM to a process and walk away, assuming it will exit. But the process has registered a SIGTERM handler that does nothing, or it is caught in an infinite loop in the handler, or it is a poorly written application that never implemented signal handling. The process continues running.

Fix: After sending SIGTERM, always verify the process actually exited. Use kill -0 to check:

kill -TERM $PID
sleep 10
if kill -0 $PID 2>/dev/null; then
    echo "WARNING: Process ignored SIGTERM"
    # Now escalate to SIGKILL
fi

Trying to kill zombie processes. You see zombie processes (state Z) and keep sending kill -9 to them. Nothing happens. You try kill -KILL. Nothing. You try every signal. Nothing. Zombies are already dead — they have already exited. The entry in the process table exists only because the parent has not called wait() to collect the exit status.

Fix: Kill or fix the parent process. When the parent dies, the zombies are adopted by PID 1, which reaps them immediately. Find the parent:

ps -o pid,ppid,stat,comm -p $(ps -eo pid,stat | awk '$2~/Z/{print $1}')

Sending signals to a process group when you meant a single process. kill -TERM -1234 (negative PID) sends SIGTERM to the entire process group with PGID 1234, not just PID 1234. If you accidentally include the minus sign, you may kill dozens of unrelated processes that share the same group. The worst case: kill -1 sends a signal to every process you own.

Fix: Always verify the sign and value of the PID argument. Use ps -o pid,pgid,comm to check group membership before killing a group. If you intend to kill a single process, never use a negative PID.

Expecting nohup to handle everything automatically. You run nohup ./script.sh & and log out. Later you discover: the output went to nohup.out in whatever directory you happened to be in (maybe /tmp, maybe /), stderr was not redirected, the disk filled up from unbounded output, or the process died anyway because it depended on the terminal for input.

Fix: Always redirect both stdout and stderr explicitly:

nohup ./script.sh > /var/log/task.log 2>&1 &

For anything important, use tmux, screen, or systemd-run instead. nohup is a duct-tape solution — it only blocks SIGHUP and redirects stdout. It does not manage the process, restart on failure, or capture structured logs.

nice values being counterintuitive. Lower nice values mean higher priority. nice -n -20 is the highest priority; nice -n 19 is the lowest. New users often run nice -n 20 important-task thinking they are giving it high priority, when they are actually giving it the lowest possible priority.

Fix: Remember: "nice" means "how nice are you to other processes?" A high nice value means very nice (yielding), low nice means aggressive (takes priority). Only root can set negative nice values. Verify with:

ps -o pid,ni,comm -p $PID   # Check the NI column

Ctrl+C in a script killing background processes unexpectedly. You have a script that starts several background processes. You press Ctrl+C to cancel the script. SIGINT is sent to the entire foreground process group, which may include the background children if they share the same group. Your "background" workers all die.

Fix: Background processes in scripts should be started with explicit signal handling:

trap 'echo "Interrupted"; kill $(jobs -p) 2>/dev/null; exit 1' INT TERM
./worker1.sh &
./worker2.sh &
wait

Or set background children to ignore SIGINT:

(trap '' INT; exec ./worker.sh) &

SIGPIPE breaking pipes silently. You have a long-running producer piped to a consumer. The consumer crashes or exits. The producer receives SIGPIPE on the next write and dies — silently, with no error message, exit code 141 (128 + 13). Your monitoring sees the process as "exited" not "crashed." Data stops flowing and nobody notices.

Fix: Programs that produce output through pipes should handle SIGPIPE. In bash scripts:

trap '' SIGPIPE   # Ignore SIGPIPE — let write() return EPIPE instead

In Python: signal.signal(signal.SIGPIPE, signal.SIG_DFL). Monitor for unexpected exit codes (141) in your pipelines.

D-state processes cannot be killed. Period. A process is stuck in uninterruptible sleep (D-state). You try SIGTERM, SIGKILL, SIGSTOP — nothing works. You try multiple times. You open a ticket saying "kill is broken." It is not broken. D-state means the process is waiting for a kernel I/O operation, and the kernel will not deliver any signal until the I/O completes (or fails).

Fix: Do not waste time sending signals. Identify the I/O subsystem:
```
cat /proc/<PID>/stack    # Shows the kernel function it is stuck in
cat /proc/<PID>/wchan    # Single function name
```
Common culprits: dead NFS server (fix the network or umount -f -l), failing disk (check dmesg), hung FUSE mount (fusermount -uz). If nothing works, reboot is the only option.

Debug clue: If /proc/<PID>/wchan shows nfs_wait_bit_killable or rpc_wait_bit_killable, the process is blocked on an NFS call. If it shows io_schedule, it is waiting on local disk I/O. If it shows fuse_dev_do_read, a FUSE filesystem's userspace daemon is not responding. The wchan value tells you exactly which kernel subsystem is the bottleneck.
Forgetting that killing a parent orphans all its children. You kill a misbehaving supervisor. Its 30 worker processes are now orphaned, reparented to PID 1, still running, still consuming resources, still holding connections and file descriptors. Nobody is managing or monitoring them. If the supervisor restarts, it spawns 30 new workers — now you have 60.

Fix: Kill the process group, not just the parent:
```
kill -TERM -$(ps -o pgid= -p $PID | tr -d ' ')   # Kill entire process group
```
Or kill children first:
```
pkill -TERM -P $PID   # Kill children
sleep 2
kill -TERM $PID       # Then the parent
```
Verify with pstree -p $PID before and after.

Linux Signals & Process Control - Footguns¶

Pages that link here¶