linux
l2
runbook
process-management
linux-signals-process-control --- Portal | Level: L2: Operations | Topics: Process Management, Linux Signals & Process Control | Domain: Linux

Runbook: Zombie Processes Accumulating¶

Field	Value
Domain	Linux
Alert	`node_processes_state{state="Z"} > 10` sustained
Severity	P3
Est. Resolution Time	20-40 minutes
Escalation Timeout	60 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	SSH access to the node, sudo access

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
ps aux | grep Z | grep -v grep

If output shows: Fewer than 10 zombie processes AND system otherwise stable → Low urgency, schedule investigation rather than emergency response If output shows: Dozens or hundreds of zombies → Parent process is leaking children rapidly, continue urgently from Step 1

Step 1: Count Zombie Processes and Check System Health¶

Why: The number of zombies tells you severity. A handful (< 10) has no real system impact. Hundreds can exhaust the process table and prevent any new processes from starting.

# Count total zombie processes
ps aux | awk '$8 == "Z"' | wc -l

# Show all zombie processes with their PID and PPID
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/'

# Check if PID limit is being approached
cat /proc/sys/kernel/pid_max
ps aux | wc -l

# Check system load (zombies themselves use no CPU, but their parent may)
uptime

Expected output:

# Each zombie line looks like:
PID    PPID  STAT  COMMAND
12345  5678   Z    defunct
12346  5678   Z    defunct

If this fails: If ps itself hangs or is very slow, the system is under extreme process pressure. Run cat /proc/loadavg instead for a quick load check.

Step 2: Identify Parent PIDs¶

Why: Zombies cannot be killed — they are already dead. The only fix is to make their parent process collect them using the wait() system call. You must find the parent first.

# For each zombie, find its parent PID (PPID)
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/ {print $2}' | sort -u

# Look up what process owns each PPID
ps -p <PPID> -o pid,ppid,stat,comm,args

# Show full process tree to understand the parent-child relationship
pstree -p <PPID>

# Or use /proc to look up the parent
cat /proc/<ZOMBIE_PID>/status | grep -E "Name|Pid|PPid|State"

Expected output:

# The parent process running, accumulating zombie children:
PID   PPID  STAT  COMMAND
5678  1234   S    my-worker-daemon

If this fails: If the parent PID is 1 (init/systemd), the original parent has already exited and systemd has been assigned as foster parent. Systemd should reap these automatically — if it is not, check for a systemd bug or resource exhaustion.

Step 3: Check Parent Process Status¶

Why: The parent may be stuck (hung, deadlocked, or CPU-starved) and unable to call wait() to reap its children. Understanding the parent's state guides the fix.

# Check parent process state
ps -p <PPID> -o pid,stat,wchan,comm,args

# Check if parent is in uninterruptible sleep (D state) — this is a problem
cat /proc/<PPID>/wchan

# Check parent's open file descriptors (a very large number can indicate a leak)
ls -l /proc/<PPID>/fd | wc -l

# Check parent logs for errors
sudo journalctl -u <SERVICE_NAME_FOR_PPID> --since "30 minutes ago" | tail -50

Expected output:

# Healthy parent: in S (sleeping) or R (running) state, not D (uninterruptible)
PID   STAT  WCHAN            COMMAND
5678  S     poll_schedule    my-worker-daemon

If this fails: If the parent is in D (uninterruptible sleep), it is waiting on I/O or a kernel resource and cannot reap children. This is more serious — check for disk I/O issues (iostat -x 1 5) or kernel deadlocks (sudo dmesg | tail -30).

Step 4: Send SIGCHLD to Parent to Trigger Reaping¶

Why: SIGCHLD is the signal that tells a parent process one of its children has changed state. Sending it manually can prompt a well-written parent to call wait() and clean up zombie children.

# Send SIGCHLD to the parent process
sudo kill -SIGCHLD <PPID>

# Wait a few seconds and check if zombies decreased
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l

# If the parent is a well-written daemon, zombies should decrease
# Send again if count dropped — indicates SIGCHLD is working but there are many children
sudo kill -SIGCHLD <PPID>
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l

Expected output:

# Zombie count should decrease after each SIGCHLD
Before: 45 zombies
After SIGCHLD: 12 zombies
After second SIGCHLD: 0 zombies

If this fails: If SIGCHLD has no effect, the parent process has a bug — it is not calling wait() when it receives the signal. Proceed to restart the parent in Step 5.

Step 5: If Parent Is Buggy — Restart It¶

Why: A parent that refuses to reap its children has a bug. Restarting it causes all its children (including zombies) to be reparented to PID 1 (init/systemd), which will reap them.

# Check if the service is managed by systemd
systemctl status <SERVICE_NAME>

# Gracefully restart via systemd (preferred — handles dependency ordering)
sudo systemctl restart <SERVICE_NAME>

# Verify zombies were cleaned up after parent restart
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l

# If not managed by systemd, gracefully restart the process
sudo kill -SIGTERM <PPID>
sleep 10
ps -p <PPID>   # Should be gone; systemd/init will restart if configured

Expected output:

# After restart: 0 zombie processes
0

If this fails: If restarting the service causes a service outage (it is not replicated), coordinate a maintenance window with the application team. For a replicated service, drain traffic first: kubectl drain <NODE_NAME> --ignore-daemonsets or shift load away from this node.

Step 6: If Systemd Process — Check the Unit File¶

Why: If the parent is a systemd-managed service, incorrect KillMode or ExecStop settings can prevent systemd from properly waiting on child processes.

# Check the unit file for child process handling
sudo systemctl cat <SERVICE_NAME>

# Key settings to check:
# KillMode=control-group  (recommended — kills all processes in the cgroup)
# KillMode=process        (only kills main process — may leave children)
# TimeoutStopSec=<N>      (how long systemd waits for graceful stop)

# Check if the cgroup still has processes after stop
sudo systemctl stop <SERVICE_NAME>
sudo systemd-cgls /system.slice/<SERVICE_NAME>.service

# Fix KillMode if needed (create an override)
sudo systemctl edit <SERVICE_NAME>
# Add:
# [Service]
# KillMode=control-group

Expected output:

# After fixing KillMode and restarting:
# systemctl stop <SERVICE_NAME> should kill all child processes
# No zombies should remain after stop

If this fails: If the service design intentionally forks long-running children that should outlive the parent, this is an architectural issue. The service needs to be rewritten to track and wait on its own children, or to use a proper subreaper.

Verification¶

# Confirm the issue is resolved
ps aux | awk '$8 == "Z"' | wc -l

Success looks like: Zombie count is 0 or below 5 (a tiny number of transient zombies is normal on a busy system). Alert has cleared. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 60 min	Application team on-call	"Zombie processes accumulating from parent PID (), process table at risk of exhaustion, service restart has not resolved"
Data loss suspected	Application team lead	"Parent process killed to clear zombies, in-flight work may have been lost"
Scope expanding	SRE lead	"Zombie accumulation pattern affecting multiple services/nodes, possible kernel or init system regression"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete

Common Mistakes¶

Trying to kill zombie processes directly: This never works — you cannot kill a zombie because it is already dead (it has exited and its resources have been freed). The entry in the process table is just waiting for the parent to collect the exit status. Only the parent (or init after parent exits) can remove it.
Not finding the parent: Sending signals to the zombie PID itself does nothing. The entire fix lives with the parent process. Always identify the PPID before taking action.
Ignoring low zombie counts (< 10 is often fine): A single-digit zombie count is not worth emergency response — transient zombies appear and disappear on healthy systems. Page only when the count is growing rapidly or approaching hundreds.

Cross-References¶

Topic Pack: Linux Process Management (deep background)
Related Runbook: Systemd Service Crash Loop

Linux Processes Flashcards (CLI) (flashcard_deck, L1) — Process Management
Linux Signals & Process Control (Topic Pack, L1) — Linux Signals & Process Control
Process Management (Topic Pack, L1) — Process Management
Runbook: High CPU (Runaway Process) (Runbook, L1) — Process Management

Runbook: Zombie Processes Accumulating¶

Quick Assessment (30 seconds)¶

Step 1: Count Zombie Processes and Check System Health¶

Step 2: Identify Parent PIDs¶

Step 3: Check Parent Process Status¶

Step 4: Send SIGCHLD to Parent to Trigger Reaping¶

Step 5: If Parent Is Buggy — Restart It¶

Step 6: If Systemd Process — Check the Unit File¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Zombie Processes Accumulating¶

Quick Assessment (30 seconds)¶

Step 1: Count Zombie Processes and Check System Health¶

Step 2: Identify Parent PIDs¶

Step 3: Check Parent Process Status¶

Step 4: Send SIGCHLD to Parent to Trigger Reaping¶

Step 5: If Parent Is Buggy — Restart It¶

Step 6: If Systemd Process — Check the Unit File¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶