Skip to content

Portal | Level: L1: Foundations | Topics: Process Management, Linux Fundamentals, Bash / Shell Scripting | Domain: Linux

Process Management - Primer

Why This Matters

Every service you deploy, every container you run, every script you fire off — it is a process. When things go wrong in production, the answer is almost always hiding in process behavior: a zombie consuming a PID slot, a D-state process blocking a mount, an orphan leaking file descriptors. If you cannot read process state, you cannot debug Linux systems. Period.

Understanding process management is not about memorizing signal numbers. It is about knowing how the kernel manages work, how parent-child relationships define cleanup responsibility, and how to intervene surgically when something goes sideways.


Process Lifecycle: Fork, Exec, Wait

Every process in Linux begins the same way. There is no "create process" system call. Instead:

Parent Process (PID 100)
    ├── fork()  ──────▶  Child Process (PID 101)
    │                     [exact copy of parent]
    │                          │
    │                          ├── exec()
    │                          │   [replaces memory with new program]
    │                          │
    │                          ├── ... does work ...
    │                          │
    │                          └── exit(status)
    │                                │
    └── wait(&status)  ◀─────────────┘
        [collects exit code, reaps child]

Under the hood: Modern Linux does not actually copy the parent's memory on fork(). It uses copy-on-write (COW): parent and child share the same physical pages, marked read-only. Only when one process writes to a page does the kernel create a private copy. This makes fork() fast even for processes using gigabytes of RAM. Redis exploits this for background saves -- fork() creates a snapshot without doubling memory usage (unless the dataset is heavily modified during the save).

  1. fork(): Parent creates a child. Child is an almost-exact copy (same memory, file descriptors, environment). Child gets a new PID.
  2. exec(): Child replaces itself with a new program. The PID stays the same.
  3. exit(): Child terminates, becomes a zombie until parent calls wait().
  4. wait(): Parent collects exit status. Zombie is reaped. PID is freed.

This is why every process has a parent. Check with:

ps -ef --forest
# or
pstree -p

The only exception is PID 1 (init/systemd), which has no parent and adopts orphans.


Signals

Signals are the kernel's way of poking a process. They are software interrupts.

The Signals That Matter

Signal Number Default Action Can Catch? Use Case
SIGHUP 1 Terminate Yes Reload config (daemons)
SIGINT 2 Terminate Yes Ctrl+C
SIGQUIT 3 Core dump Yes Ctrl+\ (with core dump)
SIGKILL 9 Terminate No Unconditional kill
SIGSEGV 11 Core dump Yes Segmentation fault
SIGTERM 15 Terminate Yes Polite shutdown request
SIGSTOP 19 Stop No Unconditional pause
SIGCONT 18 Continue Yes Resume stopped process
SIGCHLD 17 Ignore Yes Child state change
SIGUSR1 10 Terminate Yes Application-defined
SIGUSR2 12 Terminate Yes Application-defined

SIGTERM vs SIGKILL — This Is Not Optional Knowledge

SIGTERM (15):
  "Please shut down gracefully."
  - Process CAN catch it
  - Process CAN clean up (flush buffers, close connections, remove PID files)
  - Process CAN ignore it (badly behaved, but possible)

SIGKILL (9):
  "You are dead. The kernel is removing you. Now."
  - Process CANNOT catch it
  - Process CANNOT clean up
  - Kernel terminates the process immediately
  - Shared memory, temp files, locks — all left behind

Always send SIGTERM first. Wait. Only send SIGKILL if the process does not respond. This is what docker stop does (SIGTERM, then SIGKILL after 10s) and what Kubernetes does during pod termination.

Remember: Mnemonic: "TERM asks, KILL takes." SIGTERM (15) is a polite request the process can handle. SIGKILL (9) is the kernel forcibly removing the process. Only two signals cannot be caught: SIGKILL (9) and SIGSTOP (19). Everything else is advisory.

# Correct shutdown sequence
kill $PID            # Sends SIGTERM (default)
sleep 5
kill -0 $PID 2>/dev/null && kill -9 $PID   # SIGKILL only if still alive

Sending Signals

kill -SIGTERM 1234         # By name
kill -15 1234              # By number
kill -TERM 1234            # Short name
killall -TERM nginx        # By process name (all matching)
pkill -TERM -f "python app.py"  # By command pattern

Process States

Every process is in one of these states at any given moment:

┌─────────┐    fork()    ┌─────────┐
│ Created  │────────────▶│ Ready   │
└─────────┘              │  (R)    │
                         └────┬────┘
                              │ scheduled
                         ┌─────────┐
               ┌────────▶│ Running │◀────────┐
               │         │  (R)    │         │
               │         └────┬────┘         │
               │              │              │
          wake │    ┌─────────┼─────────┐    │ continued
               │    │         │         │    │
               │    ▼         ▼         ▼    │
          ┌─────────┐  ┌─────────┐  ┌─────────┐
          │Sleeping │  │ Stopped │  │  Zombie  │
          │ (S/D)   │  │  (T)    │  │   (Z)   │
          └─────────┘  └─────────┘  └─────────┘
State Code Meaning You Care Because...
Running R On CPU or ready to run Normal, healthy
Sleeping (interruptible) S Waiting for event, can be signaled Normal, most processes spend time here
Sleeping (uninterruptible) D Waiting for I/O, CANNOT be signaled Danger — cannot kill, usually disk/NFS
Stopped T Paused by signal (SIGSTOP/SIGTSTP) Job control, debugging
Zombie Z Exited, waiting for parent to reap PID leak if parent never waits
Dead X Being removed Transient, rarely seen

D-State: The Unkillable Process

A process in D-state (uninterruptible sleep) cannot be killed — not even with SIGKILL. It is waiting for a kernel-level I/O operation to complete. Common causes:

  • NFS server is unreachable
  • Disk is failing
  • FUSE filesystem is hung
  • iSCSI target is gone
# Find D-state processes
ps aux | awk '$8 ~ /D/'

# Check what they are waiting on
cat /proc/<PID>/wchan
cat /proc/<PID>/stack

You cannot kill D-state processes. You fix the I/O subsystem they are waiting on, or you reboot.

Debug clue: A sudden spike in D-state processes is almost always a storage problem: NFS server down, SAN path failure, disk dying, or FUSE filesystem stuck. Check dmesg for I/O errors and /proc/<PID>/wchan to see which kernel function the process is blocked on. Common wchan values: nfs_wait_bit_killable (NFS), blkdev_issue_flush (disk), fuse_request_wait (FUSE).


Zombies and Orphans

Zombies

A zombie is a process that has exited but whose parent has not called wait(). The zombie holds a slot in the process table (PID, exit status) but consumes no CPU or memory.

# Find zombies
ps aux | awk '$8 == "Z"'

# See who the parent is
ps -o pid,ppid,stat,comm -p $(ps aux | awk '$8 == "Z" {print $2}')

You cannot kill a zombie. It is already dead. You kill its parent (or fix the parent so it reaps children properly). When the parent dies, the zombie is adopted by PID 1, which reaps it.

Orphans

An orphan is a running process whose parent has died. The kernel reparents orphans to PID 1 (init/systemd), which will eventually reap them when they exit.

Orphans are not inherently bad, but they can indicate: - A supervisor crashed without stopping its children - A script spawned background processes and exited - A container init process is not handling adoption


Job Control

Job control lets you manage processes from a shell session.

# Run in background
long_task &

# List jobs
jobs -l

# Suspend foreground process
# Press Ctrl+Z  (sends SIGTSTP)

# Resume in background
bg %1

# Bring to foreground
fg %1

# Kill by job number
kill %1

nohup — Surviving Logout

When you close a terminal, the shell sends SIGHUP to all its children. They die.

# This survives logout
nohup ./long_script.sh > /var/log/script.log 2>&1 &

# Modern alternative: use tmux or screen
tmux new-session -d -s mytask './long_script.sh'

# Or use systemd for anything that should be permanent
systemd-run --unit=my-task --remain-after-exit /path/to/script.sh

The /proc Filesystem

/proc is a virtual filesystem. It does not exist on disk. It is the kernel exposing process state as files.

Per-Process Information

# Command line that started the process
cat /proc/1234/cmdline | tr '\0' ' '

# Environment variables
cat /proc/1234/environ | tr '\0' '\n'

# Current working directory
readlink /proc/1234/cwd

# Executable path
readlink /proc/1234/exe

# Open file descriptors
ls -la /proc/1234/fd/

# File descriptor count
ls /proc/1234/fd/ | wc -l

# Memory map
cat /proc/1234/maps

# Memory usage summary
cat /proc/1234/status | grep -E 'VmSize|VmRSS|VmSwap|Threads'

# Process state
cat /proc/1234/stat | awk '{print $3}'

# Network connections (per process)
cat /proc/1234/net/tcp

# Limits
cat /proc/1234/limits

# What the process is waiting on (kernel function)
cat /proc/1234/wchan

System-Wide Information

# CPU info
cat /proc/cpuinfo | grep "model name" | head -1

# Memory
cat /proc/meminfo | head -10

# Load average
cat /proc/loadavg

# Uptime
cat /proc/uptime

# All mounted filesystems
cat /proc/mounts

# Kernel parameters
cat /proc/sys/kernel/pid_max
cat /proc/sys/fs/file-max

Process Trees

Understanding parent-child relationships is critical for debugging:

# Full process tree
pstree -p

# Process tree for a specific PID
pstree -p 1234

# Show process tree with threads
pstree -pt 1234

# ps with hierarchy
ps auxf

# Find all descendants of a process
ps --ppid 1234 --forest

In containers, PID 1 is the entrypoint. If PID 1 is not a proper init (does not reap zombies, does not forward signals), you get zombie accumulation. This is why tini and dumb-init exist:

# Use tini as PID 1 in containers
RUN apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["python", "app.py"]

Process Resource Limits

Every process has resource limits (ulimits):

# View limits for current shell
ulimit -a

# View limits for a running process
cat /proc/1234/limits

# Key limits:
# Max open files:       ulimit -n
# Max processes:        ulimit -u
# Max memory (KB):      ulimit -v
# Core file size:       ulimit -c

Common production issues: - Too many open files — increase nofile limit - Cannot fork — hit max processes (nproc) - No core dump generated — core file size is 0

Set in /etc/security/limits.conf or systemd unit files:

# /etc/security/limits.conf
appuser  soft  nofile  65536
appuser  hard  nofile  65536

# systemd unit
[Service]
LimitNOFILE=65536
LimitNPROC=4096

Key Takeaways

  1. Every process starts with fork/exec and ends with exit/wait — there are no shortcuts
  2. Always SIGTERM before SIGKILL — give processes a chance to clean up
  3. D-state processes cannot be killed — fix the underlying I/O problem
  4. Zombies are not the problem — the parent that is not reaping them is the problem
  5. /proc is the source of truth for process state — learn to read it directly
  6. Containers need a proper init process (tini, dumb-init) or zombies accumulate
  7. Resource limits (ulimits) cause silent failures — check them early in any debugging session

Wiki Navigation

Prerequisites