Linux Signals & Process Control - Street-Level Ops¶
Killing Stuck Processes — The Correct Sequence¶
Never reach for kill -9 first. Follow this escalation:
PID=1234
# Step 1: Ask politely
kill -TERM $PID
echo "Sent SIGTERM, waiting..."
# Step 2: Wait for graceful shutdown (10-30 seconds depending on the service)
sleep 10
# Step 3: Check if it is still alive
if kill -0 $PID 2>/dev/null; then
echo "Still alive after SIGTERM. Sending SIGKILL..."
kill -9 $PID
else
echo "Process exited cleanly."
fi
Script it for production use:
#!/bin/bash
# graceful-kill.sh — SIGTERM then SIGKILL with configurable timeout
PID=$1
TIMEOUT=${2:-15}
if ! kill -0 "$PID" 2>/dev/null; then
echo "PID $PID is not running"
exit 0
fi
echo "Sending SIGTERM to $PID..."
kill -TERM "$PID"
for i in $(seq 1 "$TIMEOUT"); do
if ! kill -0 "$PID" 2>/dev/null; then
echo "Process exited after ${i}s"
exit 0
fi
sleep 1
done
echo "Process still alive after ${TIMEOUT}s. Sending SIGKILL..."
kill -9 "$PID"
sleep 1
if kill -0 "$PID" 2>/dev/null; then
echo "WARNING: Process survived SIGKILL. Likely D-state (kernel I/O wait)."
exit 1
fi
echo "Process killed."
Finding and Killing Zombie Processes¶
Zombies cannot be killed — they are already dead. You fix the parent.
# Step 1: Find zombies
ps -eo pid,ppid,stat,user,comm | awk '$3 ~ /^Z/'
# PID PPID STAT USER COMMAND
# 15432 12001 Z+ app [worker] <defunct>
# 15438 12001 Z+ app [worker] <defunct>
# Step 2: Identify the parent
ps -p 12001 -o pid,ppid,stat,comm,args
# PID PPID STAT COMMAND ARGS
# 12001 8900 S python python /app/main.py
# Step 3: Try sending SIGCHLD to the parent (hint to reap children)
kill -CHLD 12001
# Step 4: If zombies persist, the parent is buggy. Kill the parent.
kill -TERM 12001
# PID 1 (init/systemd) adopts and reaps the zombies automatically.
# Step 5: Verify zombies are gone
ps -eo stat | grep -c Z
Process Tree Analysis¶
When investigating a service, start with the tree view to understand the process hierarchy:
# Full system tree with PIDs
pstree -p
# Tree for a specific service
pstree -p $(systemctl show -p MainPID nginx | cut -d= -f2)
# nginx(1200)─┬─nginx(1201)
# ├─nginx(1202)
# ├─nginx(1203)
# └─nginx(1204)
# ps forest view with resource usage
ps auxf | grep -A 5 nginx
# Show thread count per process
ps -eo pid,nlwp,comm --sort=-nlwp | head -20
# nlwp = number of light-weight processes (threads)
# Find all descendants of a PID
pgrep -P 1200 --list-full
Sending SIGHUP for Config Reload¶
Many daemons reload configuration on SIGHUP without restarting. This avoids dropping active connections.
# nginx — reload config, gracefully shut down old workers
kill -HUP $(cat /var/run/nginx.pid)
# Equivalent:
systemctl reload nginx
# nginx starts new workers with new config, old workers finish existing requests
# HAProxy — reload config
kill -HUP $(cat /var/run/haproxy.pid)
# sshd — reread sshd_config
kill -HUP $(pidof sshd)
# rsyslog — reopen log files and reread config
kill -HUP $(pidof rsyslogd)
# PostgreSQL — reload pg_hba.conf and postgresql.conf
kill -HUP $(head -1 /var/lib/postgresql/14/main/postmaster.pid)
# Or:
pg_ctl reload -D /var/lib/postgresql/14/main
# Verify reload worked — check the process did not restart (PID unchanged)
echo "PID before: $(pidof nginx)"
kill -HUP $(pidof nginx)
echo "PID after: $(pidof nginx)"
# Should be the same PID
Debugging D-State (Uninterruptible Sleep) Processes¶
A process in D-state is stuck waiting for a kernel-level I/O operation. It cannot be killed, not even with SIGKILL.
# Find D-state processes
ps aux | awk '$8 ~ /^D/ {print $2, $11}'
# What is it waiting on?
cat /proc/<PID>/wchan
# Example output: nfs_wait_bit_killable, blkdev_issue_flush
# Full kernel stack trace (root required)
cat /proc/<PID>/stack
# Example output for NFS hang:
# [<0>] nfs4_run_open_task+0x5c/0xa0
# [<0>] __rpc_execute+0x7e/0x3b0
# [<0>] rpc_wait_bit_killable+0x20/0xf0
# Common causes and fixes:
# 1. NFS server unreachable
mount | grep nfs # Check NFS mounts
showmount -e nfs-server # Test NFS server
# Fix: restore NFS server connectivity, or umount -f -l /mnt/nfs
# 2. Failing disk
dmesg | grep -i "error\|fail\|reset\|timeout" | tail -20
smartctl -a /dev/sda | grep -E "Reallocated|Current_Pending|Offline_Uncorrectable"
# Fix: replace the disk
# 3. Hung FUSE filesystem (sshfs, s3fs, rclone mount)
mount | grep fuse
fusermount -u /mnt/sshfs # Try clean unmount
fusermount -uz /mnt/sshfs # Lazy unmount as last resort
# 4. iSCSI target gone
iscsiadm -m session # List active sessions
dmesg | grep iscsi # Check for timeouts
# Fix: restore iSCSI connectivity or log out stale sessions
There is no way to kill a D-state process. You must fix the underlying I/O issue or reboot.
Finding a Process by Port¶
# Which process is listening on port 8080?
ss -tlnp | grep :8080
# LISTEN 0 128 *:8080 *:* users:(("nginx",pid=1200,fd=6))
# Using lsof
lsof -i :8080
# COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
# nginx 1200 root 6u IPv4 24680 0t0 TCP *:8080 (LISTEN)
# Find all listening ports for a process
ss -tlnp | grep "pid=1200"
# Find which process is connected to a remote host
ss -tnp | grep "10.0.1.50"
# Find processes with established connections to a database
ss -tnp dst :5432
# or
lsof -i :5432 -sTCP:ESTABLISHED
Surviving SSH Disconnect¶
When your SSH session drops, the shell sends SIGHUP to all child processes and they die. Several ways to prevent this:
# Option 1: nohup (simple, one-off)
nohup ./long-migration.sh > /var/log/migration.log 2>&1 &
echo "PID: $!"
# Option 2: disown (detach an already-running job)
./long-migration.sh > /var/log/migration.log 2>&1 &
disown %1
# Option 3: tmux (best for interactive work)
tmux new-session -d -s migration './long-migration.sh'
# Detach: Ctrl+B then D
# Reattach after reconnect:
tmux attach -t migration
# Option 4: screen (older, still widely available)
screen -dmS migration ./long-migration.sh
# Reattach:
screen -r migration
# Option 5: systemd-run (best for production tasks)
systemd-run --unit=db-migration --remain-after-exit \
/opt/scripts/run-migration.sh
# Monitor:
journalctl -u db-migration -f
systemctl status db-migration
CPU-Hogging Process Investigation¶
When a process is consuming excessive CPU:
# Identify the process
ps aux --sort=-%cpu | head -10
# Check if it is a single thread or many
ps -T -p <PID> | wc -l # Thread count
ps -T -p <PID> -o spid,%cpu | sort -k2 -rn | head # CPU per thread
# What is it doing? (Quick peek with strace — brief, production-safe)
timeout 5 strace -p <PID> -c 2>&1
# Shows syscall summary — is it in a tight loop? Doing excessive I/O?
# Check for a spin loop (high nonvoluntary context switches)
cat /proc/<PID>/status | grep ctxt
# nonvoluntary_ctxt_switches: 450000 ← preempted constantly = CPU-bound loop
# Reduce its priority while you investigate
renice 19 -p <PID>
# Now it gets CPU only when nothing else wants it
# Freeze it entirely while you analyze
kill -STOP <PID>
# Take your time, analyze, then:
kill -CONT <PID> # Resume
# or
kill -TERM <PID> # Terminate
# Deeper analysis with perf (lightweight, production-safe)
perf top -p <PID> # Live view of hot functions
perf record -p <PID> -g -- sleep 10 # Record 10s of profile data
perf report # Analyze the recording
Priority Adjustment for Background Tasks¶
Keep production services responsive by deprioritizing batch work:
# Run backup at lowest priority
nice -n 19 tar czf /backups/daily.tar.gz /var/data/
# Run log compression at low priority with ionice too
nice -n 15 ionice -c 2 -n 7 gzip /var/log/app/access.log.1
# Adjust a running process
renice 10 -p $(pgrep -f "batch-processor")
# Set CPU affinity — pin batch work to specific cores
taskset -c 0,1 nice -n 15 ./batch-job.sh
# Production uses cores 2-7, batch is confined to 0-1
# ionice — control I/O scheduling priority
ionice -c 3 -p <PID> # Idle class: only gets I/O when nobody else wants it
ionice -c 2 -n 7 -p <PID> # Best-effort, lowest priority
strace for Understanding Process Behavior¶
strace intercepts system calls. It is the Swiss Army knife for understanding what a process is actually doing.
# What is this process doing RIGHT NOW?
strace -p <PID> -e trace=all 2>&1 | head -50
# Syscall summary (which calls dominate?)
timeout 10 strace -p <PID> -c
# What files is it opening?
strace -p <PID> -e trace=openat,open -f 2>&1 | head -20
# What network calls is it making?
strace -p <PID> -e trace=network -f 2>&1 | head -20
# Where is it spending time? (with timing)
strace -p <PID> -T -e trace=read,write 2>&1 | head -20
# The number in angle brackets is time spent in that syscall
# Trace a new command from start
strace -f -o /tmp/trace.log ./my-command
# -f follows forks (child processes too)
# Common patterns in output:
# Tight loop of futex() calls = lock contention
# Repeated poll()/epoll_wait() returning 0 = idle, waiting for events
# read() returning -1 EAGAIN = non-blocking I/O, normal
# connect() hanging = network connectivity issue
# openat() returning -1 ENOENT = missing config/data file
Warning: strace slows the traced process by 10-100x. Use timeout and keep traces short in production. For lightweight alternatives, consider perf trace or bpftrace.
Power One-Liners¶
The Magic SysRq emergency reboot¶
# Alt+SysRq+R-E-I-S-U-B
# Mnemonic: "Reboot Even If System Utterly Broken"
echo b > /proc/sysrq-trigger # programmatic equivalent (just reboot)
echo s > /proc/sysrq-trigger # sync disks
Breakdown: SysRq talks directly to the kernel, bypassing userspace. Each letter triggers a kernel function. The sequence gracefully terminates processes, syncs/unmounts filesystems, then reboots.
[!TIP] When to use: System completely unresponsive — no SSH, no shell, no GUI. Last resort before hard power cycle.
Quick Reference¶
| Task | Command |
|---|---|
| Graceful shutdown | kill -TERM <PID> then wait, then kill -9 |
| Find zombies | ps -eo pid,ppid,stat,comm \| awk '$3~/Z/' |
| Process tree | pstree -p <PID> or ps auxf |
| Config reload | kill -HUP <PID> |
| Find by port | ss -tlnp \| grep :<port> |
| Freeze process | kill -STOP <PID>, resume with kill -CONT |
| Survive logout | nohup cmd & or tmux or systemd-run |
| D-state debug | cat /proc/<PID>/stack |
| CPU investigation | perf top -p <PID> |
| Deprioritize | renice 19 -p <PID> |