Linux Signals & Process Control - Footguns¶
- Using
kill -9as your first resort. You see a process misbehaving and immediately reach forkill -9. The process dies without flushing write buffers, closing database connections, releasing file locks, removing PID files, or deregistering from service discovery. Downstream services see broken connections. Data may be corrupted. Lock files block the service from restarting. Shared memory segments and temporary files are orphaned.
Remember: SIGTERM (15) = "please shut down gracefully." SIGKILL (9) = "kernel removes you immediately." SIGTERM can be caught, handled, or ignored. SIGKILL cannot be caught, handled, or ignored — it is processed by the kernel, never delivered to the process. That is why cleanup code never runs after
kill -9.
Fix: Always send SIGTERM first. Wait 10-30 seconds (depending on the service's expected
shutdown time). Only escalate to SIGKILL if the process does not exit. This is what Docker
(docker stop), Kubernetes (pod termination), and systemd (systemctl stop) all do
internally. Script the pattern:
- Killing PID 1. On a bare-metal or VM system, killing PID 1 (init/systemd) triggers a kernel panic or immediate reboot. Inside a container, killing PID 1 terminates the entire container. Either way, everything running in that context dies immediately with no graceful shutdown.
Fix: Never target PID 1 directly. If you need to restart systemd services, use
systemctl restart. If you need to restart a container's main process, restart the
container. Double-check your target PID before sending any signal:
- Assuming SIGTERM always works. You send SIGTERM to a process and walk away, assuming it will exit. But the process has registered a SIGTERM handler that does nothing, or it is caught in an infinite loop in the handler, or it is a poorly written application that never implemented signal handling. The process continues running.
Fix: After sending SIGTERM, always verify the process actually exited. Use kill -0
to check:
kill -TERM $PID
sleep 10
if kill -0 $PID 2>/dev/null; then
echo "WARNING: Process ignored SIGTERM"
# Now escalate to SIGKILL
fi
- Trying to kill zombie processes.
You see zombie processes (state Z) and keep sending
kill -9to them. Nothing happens. You trykill -KILL. Nothing. You try every signal. Nothing. Zombies are already dead — they have already exited. The entry in the process table exists only because the parent has not calledwait()to collect the exit status.
Fix: Kill or fix the parent process. When the parent dies, the zombies are adopted by PID 1, which reaps them immediately. Find the parent:
- Sending signals to a process group when you meant a single process.
kill -TERM -1234(negative PID) sends SIGTERM to the entire process group with PGID 1234, not just PID 1234. If you accidentally include the minus sign, you may kill dozens of unrelated processes that share the same group. The worst case:kill -1sends a signal to every process you own.
Fix: Always verify the sign and value of the PID argument. Use ps -o pid,pgid,comm
to check group membership before killing a group. If you intend to kill a single process,
never use a negative PID.
- Expecting nohup to handle everything automatically.
You run
nohup ./script.sh &and log out. Later you discover: the output went tonohup.outin whatever directory you happened to be in (maybe/tmp, maybe/), stderr was not redirected, the disk filled up from unbounded output, or the process died anyway because it depended on the terminal for input.
Fix: Always redirect both stdout and stderr explicitly:
For anything important, use tmux, screen, orsystemd-run instead. nohup is a duct-tape
solution — it only blocks SIGHUP and redirects stdout. It does not manage the process,
restart on failure, or capture structured logs.
- nice values being counterintuitive.
Lower nice values mean higher priority.
nice -n -20is the highest priority;nice -n 19is the lowest. New users often runnice -n 20 important-taskthinking they are giving it high priority, when they are actually giving it the lowest possible priority.
Fix: Remember: "nice" means "how nice are you to other processes?" A high nice value means very nice (yielding), low nice means aggressive (takes priority). Only root can set negative nice values. Verify with:
- Ctrl+C in a script killing background processes unexpectedly. You have a script that starts several background processes. You press Ctrl+C to cancel the script. SIGINT is sent to the entire foreground process group, which may include the background children if they share the same group. Your "background" workers all die.
Fix: Background processes in scripts should be started with explicit signal handling:
trap 'echo "Interrupted"; kill $(jobs -p) 2>/dev/null; exit 1' INT TERM
./worker1.sh &
./worker2.sh &
wait
- SIGPIPE breaking pipes silently. You have a long-running producer piped to a consumer. The consumer crashes or exits. The producer receives SIGPIPE on the next write and dies — silently, with no error message, exit code 141 (128 + 13). Your monitoring sees the process as "exited" not "crashed." Data stops flowing and nobody notices.
Fix: Programs that produce output through pipes should handle SIGPIPE. In bash scripts:
In Python:signal.signal(signal.SIGPIPE, signal.SIG_DFL). Monitor for unexpected exit
codes (141) in your pipelines.
-
D-state processes cannot be killed. Period. A process is stuck in uninterruptible sleep (D-state). You try SIGTERM, SIGKILL, SIGSTOP — nothing works. You try multiple times. You open a ticket saying "kill is broken." It is not broken. D-state means the process is waiting for a kernel I/O operation, and the kernel will not deliver any signal until the I/O completes (or fails).
Fix: Do not waste time sending signals. Identify the I/O subsystem:
Common culprits: dead NFS server (fix the network orcat /proc/<PID>/stack # Shows the kernel function it is stuck in cat /proc/<PID>/wchan # Single function nameumount -f -l), failing disk (checkdmesg), hung FUSE mount (fusermount -uz). If nothing works, reboot is the only option.Debug clue: If
/proc/<PID>/wchanshowsnfs_wait_bit_killableorrpc_wait_bit_killable, the process is blocked on an NFS call. If it showsio_schedule, it is waiting on local disk I/O. If it showsfuse_dev_do_read, a FUSE filesystem's userspace daemon is not responding. Thewchanvalue tells you exactly which kernel subsystem is the bottleneck. -
Forgetting that killing a parent orphans all its children. You kill a misbehaving supervisor. Its 30 worker processes are now orphaned, reparented to PID 1, still running, still consuming resources, still holding connections and file descriptors. Nobody is managing or monitoring them. If the supervisor restarts, it spawns 30 new workers — now you have 60.
Fix: Kill the process group, not just the parent:
Or kill children first: Verify withpstree -p $PIDbefore and after.