Linux Signals & Process Control - Street-Level Ops¶

Killing Stuck Processes — The Correct Sequence¶

Never reach for kill -9 first. Follow this escalation:

PID=1234

# Step 1: Ask politely
kill -TERM $PID
echo "Sent SIGTERM, waiting..."

# Step 2: Wait for graceful shutdown (10-30 seconds depending on the service)
sleep 10

# Step 3: Check if it is still alive
if kill -0 $PID 2>/dev/null; then
    echo "Still alive after SIGTERM. Sending SIGKILL..."
    kill -9 $PID
else
    echo "Process exited cleanly."
fi

Script it for production use:

#!/bin/bash
# graceful-kill.sh — SIGTERM then SIGKILL with configurable timeout
PID=$1
TIMEOUT=${2:-15}

if ! kill -0 "$PID" 2>/dev/null; then
    echo "PID $PID is not running"
    exit 0
fi

echo "Sending SIGTERM to $PID..."
kill -TERM "$PID"

for i in $(seq 1 "$TIMEOUT"); do
    if ! kill -0 "$PID" 2>/dev/null; then
        echo "Process exited after ${i}s"
        exit 0
    fi
    sleep 1
done

echo "Process still alive after ${TIMEOUT}s. Sending SIGKILL..."
kill -9 "$PID"
sleep 1

if kill -0 "$PID" 2>/dev/null; then
    echo "WARNING: Process survived SIGKILL. Likely D-state (kernel I/O wait)."
    exit 1
fi
echo "Process killed."

Finding and Killing Zombie Processes¶

Zombies cannot be killed — they are already dead. You fix the parent.

# Step 1: Find zombies
ps -eo pid,ppid,stat,user,comm | awk '$3 ~ /^Z/'
#   PID  PPID STAT USER     COMMAND
# 15432 12001 Z+   app      [worker] <defunct>
# 15438 12001 Z+   app      [worker] <defunct>

# Step 2: Identify the parent
ps -p 12001 -o pid,ppid,stat,comm,args
#   PID  PPID STAT COMMAND ARGS
# 12001  8900 S    python  python /app/main.py

# Step 3: Try sending SIGCHLD to the parent (hint to reap children)
kill -CHLD 12001

# Step 4: If zombies persist, the parent is buggy. Kill the parent.
kill -TERM 12001
# PID 1 (init/systemd) adopts and reaps the zombies automatically.

# Step 5: Verify zombies are gone
ps -eo stat | grep -c Z

Process Tree Analysis¶

When investigating a service, start with the tree view to understand the process hierarchy:

# Full system tree with PIDs
pstree -p

# Tree for a specific service
pstree -p $(systemctl show -p MainPID nginx | cut -d= -f2)
# nginx(1200)─┬─nginx(1201)
#              ├─nginx(1202)
#              ├─nginx(1203)
#              └─nginx(1204)

# ps forest view with resource usage
ps auxf | grep -A 5 nginx

# Show thread count per process
ps -eo pid,nlwp,comm --sort=-nlwp | head -20
# nlwp = number of light-weight processes (threads)

# Find all descendants of a PID
pgrep -P 1200 --list-full

Sending SIGHUP for Config Reload¶

Many daemons reload configuration on SIGHUP without restarting. This avoids dropping active connections.

# nginx — reload config, gracefully shut down old workers
kill -HUP $(cat /var/run/nginx.pid)
# Equivalent:
systemctl reload nginx
# nginx starts new workers with new config, old workers finish existing requests

# HAProxy — reload config
kill -HUP $(cat /var/run/haproxy.pid)

# sshd — reread sshd_config
kill -HUP $(pidof sshd)

# rsyslog — reopen log files and reread config
kill -HUP $(pidof rsyslogd)

# PostgreSQL — reload pg_hba.conf and postgresql.conf
kill -HUP $(head -1 /var/lib/postgresql/14/main/postmaster.pid)
# Or:
pg_ctl reload -D /var/lib/postgresql/14/main

# Verify reload worked — check the process did not restart (PID unchanged)
echo "PID before: $(pidof nginx)"
kill -HUP $(pidof nginx)
echo "PID after: $(pidof nginx)"
# Should be the same PID

Debugging D-State (Uninterruptible Sleep) Processes¶

A process in D-state is stuck waiting for a kernel-level I/O operation. It cannot be killed, not even with SIGKILL.

# Find D-state processes
ps aux | awk '$8 ~ /^D/ {print $2, $11}'

# What is it waiting on?
cat /proc/<PID>/wchan
# Example output: nfs_wait_bit_killable, blkdev_issue_flush

# Full kernel stack trace (root required)
cat /proc/<PID>/stack
# Example output for NFS hang:
# [<0>] nfs4_run_open_task+0x5c/0xa0
# [<0>] __rpc_execute+0x7e/0x3b0
# [<0>] rpc_wait_bit_killable+0x20/0xf0

# Common causes and fixes:
# 1. NFS server unreachable
mount | grep nfs    # Check NFS mounts
showmount -e nfs-server  # Test NFS server
# Fix: restore NFS server connectivity, or umount -f -l /mnt/nfs

# 2. Failing disk
dmesg | grep -i "error\|fail\|reset\|timeout" | tail -20
smartctl -a /dev/sda | grep -E "Reallocated|Current_Pending|Offline_Uncorrectable"
# Fix: replace the disk

# 3. Hung FUSE filesystem (sshfs, s3fs, rclone mount)
mount | grep fuse
fusermount -u /mnt/sshfs   # Try clean unmount
fusermount -uz /mnt/sshfs  # Lazy unmount as last resort

# 4. iSCSI target gone
iscsiadm -m session        # List active sessions
dmesg | grep iscsi         # Check for timeouts
# Fix: restore iSCSI connectivity or log out stale sessions

There is no way to kill a D-state process. You must fix the underlying I/O issue or reboot.

Finding a Process by Port¶

# Which process is listening on port 8080?
ss -tlnp | grep :8080
# LISTEN  0  128  *:8080  *:*  users:(("nginx",pid=1200,fd=6))

# Using lsof
lsof -i :8080
# COMMAND  PID  USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
# nginx   1200  root    6u  IPv4  24680      0t0  TCP *:8080 (LISTEN)

# Find all listening ports for a process
ss -tlnp | grep "pid=1200"

# Find which process is connected to a remote host
ss -tnp | grep "10.0.1.50"

# Find processes with established connections to a database
ss -tnp dst :5432
# or
lsof -i :5432 -sTCP:ESTABLISHED

Surviving SSH Disconnect¶

When your SSH session drops, the shell sends SIGHUP to all child processes and they die. Several ways to prevent this:

# Option 1: nohup (simple, one-off)
nohup ./long-migration.sh > /var/log/migration.log 2>&1 &
echo "PID: $!"

# Option 2: disown (detach an already-running job)
./long-migration.sh > /var/log/migration.log 2>&1 &
disown %1

# Option 3: tmux (best for interactive work)
tmux new-session -d -s migration './long-migration.sh'
# Detach: Ctrl+B then D
# Reattach after reconnect:
tmux attach -t migration

# Option 4: screen (older, still widely available)
screen -dmS migration ./long-migration.sh
# Reattach:
screen -r migration

# Option 5: systemd-run (best for production tasks)
systemd-run --unit=db-migration --remain-after-exit \
    /opt/scripts/run-migration.sh
# Monitor:
journalctl -u db-migration -f
systemctl status db-migration

CPU-Hogging Process Investigation¶

When a process is consuming excessive CPU:

# Identify the process
ps aux --sort=-%cpu | head -10

# Check if it is a single thread or many
ps -T -p <PID> | wc -l    # Thread count
ps -T -p <PID> -o spid,%cpu | sort -k2 -rn | head  # CPU per thread

# What is it doing? (Quick peek with strace — brief, production-safe)
timeout 5 strace -p <PID> -c 2>&1
# Shows syscall summary — is it in a tight loop? Doing excessive I/O?

# Check for a spin loop (high nonvoluntary context switches)
cat /proc/<PID>/status | grep ctxt
# nonvoluntary_ctxt_switches: 450000  ← preempted constantly = CPU-bound loop

# Reduce its priority while you investigate
renice 19 -p <PID>
# Now it gets CPU only when nothing else wants it

# Freeze it entirely while you analyze
kill -STOP <PID>
# Take your time, analyze, then:
kill -CONT <PID>   # Resume
# or
kill -TERM <PID>   # Terminate

# Deeper analysis with perf (lightweight, production-safe)
perf top -p <PID>                # Live view of hot functions
perf record -p <PID> -g -- sleep 10  # Record 10s of profile data
perf report                           # Analyze the recording

Priority Adjustment for Background Tasks¶

Keep production services responsive by deprioritizing batch work:

# Run backup at lowest priority
nice -n 19 tar czf /backups/daily.tar.gz /var/data/

# Run log compression at low priority with ionice too
nice -n 15 ionice -c 2 -n 7 gzip /var/log/app/access.log.1

# Adjust a running process
renice 10 -p $(pgrep -f "batch-processor")

# Set CPU affinity — pin batch work to specific cores
taskset -c 0,1 nice -n 15 ./batch-job.sh
# Production uses cores 2-7, batch is confined to 0-1

# ionice — control I/O scheduling priority
ionice -c 3 -p <PID>  # Idle class: only gets I/O when nobody else wants it
ionice -c 2 -n 7 -p <PID>  # Best-effort, lowest priority

strace for Understanding Process Behavior¶

strace intercepts system calls. It is the Swiss Army knife for understanding what a process is actually doing.

# What is this process doing RIGHT NOW?
strace -p <PID> -e trace=all 2>&1 | head -50

# Syscall summary (which calls dominate?)
timeout 10 strace -p <PID> -c

# What files is it opening?
strace -p <PID> -e trace=openat,open -f 2>&1 | head -20

# What network calls is it making?
strace -p <PID> -e trace=network -f 2>&1 | head -20

# Where is it spending time? (with timing)
strace -p <PID> -T -e trace=read,write 2>&1 | head -20
# The number in angle brackets is time spent in that syscall

# Trace a new command from start
strace -f -o /tmp/trace.log ./my-command
# -f follows forks (child processes too)

# Common patterns in output:
# Tight loop of futex() calls = lock contention
# Repeated poll()/epoll_wait() returning 0 = idle, waiting for events
# read() returning -1 EAGAIN = non-blocking I/O, normal
# connect() hanging = network connectivity issue
# openat() returning -1 ENOENT = missing config/data file

Warning: strace slows the traced process by 10-100x. Use timeout and keep traces short in production. For lightweight alternatives, consider perf trace or bpftrace.

Power One-Liners¶

The Magic SysRq emergency reboot¶

# Alt+SysRq+R-E-I-S-U-B
# Mnemonic: "Reboot Even If System Utterly Broken"
echo b > /proc/sysrq-trigger   # programmatic equivalent (just reboot)
echo s > /proc/sysrq-trigger   # sync disks

Breakdown: SysRq talks directly to the kernel, bypassing userspace. Each letter triggers a kernel function. The sequence gracefully terminates processes, syncs/unmounts filesystems, then reboots.

[!TIP] When to use: System completely unresponsive — no SSH, no shell, no GUI. Last resort before hard power cycle.

Quick Reference¶

Task	Command
Graceful shutdown	`kill -TERM <PID>` then wait, then `kill -9`
Find zombies	`ps -eo pid,ppid,stat,comm \\| awk '$3~/Z/'`
Process tree	`pstree -p <PID>` or `ps auxf`
Config reload	`kill -HUP <PID>`
Find by port	`ss -tlnp \\| grep :<port>`
Freeze process	`kill -STOP <PID>`, resume with `kill -CONT`
Survive logout	`nohup cmd &` or `tmux` or `systemd-run`
D-state debug	`cat /proc/<PID>/stack`
CPU investigation	`perf top -p <PID>`
Deprioritize	`renice 19 -p <PID>`