Skip to content

Runbook: Zombie Processes Accumulating

Field Value
Domain Linux
Alert node_processes_state{state="Z"} > 10 sustained
Severity P3
Est. Resolution Time 20-40 minutes
Escalation Timeout 60 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to the node, sudo access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
ps aux | grep Z | grep -v grep
If output shows: Fewer than 10 zombie processes AND system otherwise stable → Low urgency, schedule investigation rather than emergency response If output shows: Dozens or hundreds of zombies → Parent process is leaking children rapidly, continue urgently from Step 1

Step 1: Count Zombie Processes and Check System Health

Why: The number of zombies tells you severity. A handful (< 10) has no real system impact. Hundreds can exhaust the process table and prevent any new processes from starting.

# Count total zombie processes
ps aux | awk '$8 == "Z"' | wc -l

# Show all zombie processes with their PID and PPID
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/'

# Check if PID limit is being approached
cat /proc/sys/kernel/pid_max
ps aux | wc -l

# Check system load (zombies themselves use no CPU, but their parent may)
uptime
Expected output:
# Each zombie line looks like:
PID    PPID  STAT  COMMAND
12345  5678   Z    defunct
12346  5678   Z    defunct
If this fails: If ps itself hangs or is very slow, the system is under extreme process pressure. Run cat /proc/loadavg instead for a quick load check.

Step 2: Identify Parent PIDs

Why: Zombies cannot be killed — they are already dead. The only fix is to make their parent process collect them using the wait() system call. You must find the parent first.

# For each zombie, find its parent PID (PPID)
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/ {print $2}' | sort -u

# Look up what process owns each PPID
ps -p <PPID> -o pid,ppid,stat,comm,args

# Show full process tree to understand the parent-child relationship
pstree -p <PPID>

# Or use /proc to look up the parent
cat /proc/<ZOMBIE_PID>/status | grep -E "Name|Pid|PPid|State"
Expected output:
# The parent process running, accumulating zombie children:
PID   PPID  STAT  COMMAND
5678  1234   S    my-worker-daemon
If this fails: If the parent PID is 1 (init/systemd), the original parent has already exited and systemd has been assigned as foster parent. Systemd should reap these automatically — if it is not, check for a systemd bug or resource exhaustion.

Step 3: Check Parent Process Status

Why: The parent may be stuck (hung, deadlocked, or CPU-starved) and unable to call wait() to reap its children. Understanding the parent's state guides the fix.

# Check parent process state
ps -p <PPID> -o pid,stat,wchan,comm,args

# Check if parent is in uninterruptible sleep (D state) — this is a problem
cat /proc/<PPID>/wchan

# Check parent's open file descriptors (a very large number can indicate a leak)
ls -l /proc/<PPID>/fd | wc -l

# Check parent logs for errors
sudo journalctl -u <SERVICE_NAME_FOR_PPID> --since "30 minutes ago" | tail -50
Expected output:
# Healthy parent: in S (sleeping) or R (running) state, not D (uninterruptible)
PID   STAT  WCHAN            COMMAND
5678  S     poll_schedule    my-worker-daemon
If this fails: If the parent is in D (uninterruptible sleep), it is waiting on I/O or a kernel resource and cannot reap children. This is more serious — check for disk I/O issues (iostat -x 1 5) or kernel deadlocks (sudo dmesg | tail -30).

Step 4: Send SIGCHLD to Parent to Trigger Reaping

Why: SIGCHLD is the signal that tells a parent process one of its children has changed state. Sending it manually can prompt a well-written parent to call wait() and clean up zombie children.

# Send SIGCHLD to the parent process
sudo kill -SIGCHLD <PPID>

# Wait a few seconds and check if zombies decreased
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l

# If the parent is a well-written daemon, zombies should decrease
# Send again if count dropped — indicates SIGCHLD is working but there are many children
sudo kill -SIGCHLD <PPID>
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l
Expected output:
# Zombie count should decrease after each SIGCHLD
Before: 45 zombies
After SIGCHLD: 12 zombies
After second SIGCHLD: 0 zombies
If this fails: If SIGCHLD has no effect, the parent process has a bug — it is not calling wait() when it receives the signal. Proceed to restart the parent in Step 5.

Step 5: If Parent Is Buggy — Restart It

Why: A parent that refuses to reap its children has a bug. Restarting it causes all its children (including zombies) to be reparented to PID 1 (init/systemd), which will reap them.

# Check if the service is managed by systemd
systemctl status <SERVICE_NAME>

# Gracefully restart via systemd (preferred — handles dependency ordering)
sudo systemctl restart <SERVICE_NAME>

# Verify zombies were cleaned up after parent restart
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l

# If not managed by systemd, gracefully restart the process
sudo kill -SIGTERM <PPID>
sleep 10
ps -p <PPID>   # Should be gone; systemd/init will restart if configured
Expected output:
# After restart: 0 zombie processes
0
If this fails: If restarting the service causes a service outage (it is not replicated), coordinate a maintenance window with the application team. For a replicated service, drain traffic first: kubectl drain <NODE_NAME> --ignore-daemonsets or shift load away from this node.

Step 6: If Systemd Process — Check the Unit File

Why: If the parent is a systemd-managed service, incorrect KillMode or ExecStop settings can prevent systemd from properly waiting on child processes.

# Check the unit file for child process handling
sudo systemctl cat <SERVICE_NAME>

# Key settings to check:
# KillMode=control-group  (recommended — kills all processes in the cgroup)
# KillMode=process        (only kills main process — may leave children)
# TimeoutStopSec=<N>      (how long systemd waits for graceful stop)

# Check if the cgroup still has processes after stop
sudo systemctl stop <SERVICE_NAME>
sudo systemd-cgls /system.slice/<SERVICE_NAME>.service

# Fix KillMode if needed (create an override)
sudo systemctl edit <SERVICE_NAME>
# Add:
# [Service]
# KillMode=control-group
Expected output:
# After fixing KillMode and restarting:
# systemctl stop <SERVICE_NAME> should kill all child processes
# No zombies should remain after stop
If this fails: If the service design intentionally forks long-running children that should outlive the parent, this is an architectural issue. The service needs to be rewritten to track and wait on its own children, or to use a proper subreaper.

Verification

# Confirm the issue is resolved
ps aux | awk '$8 == "Z"' | wc -l
Success looks like: Zombie count is 0 or below 5 (a tiny number of transient zombies is normal on a busy system). Alert has cleared. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 60 min Application team on-call "Zombie processes accumulating from parent PID (), process table at risk of exhaustion, service restart has not resolved"
Data loss suspected Application team lead "Parent process killed to clear zombies, in-flight work may have been lost"
Scope expanding SRE lead "Zombie accumulation pattern affecting multiple services/nodes, possible kernel or init system regression"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Trying to kill zombie processes directly: This never works — you cannot kill a zombie because it is already dead (it has exited and its resources have been freed). The entry in the process table is just waiting for the parent to collect the exit status. Only the parent (or init after parent exits) can remove it.
  2. Not finding the parent: Sending signals to the zombie PID itself does nothing. The entire fix lives with the parent process. Always identify the PPID before taking action.
  3. Ignoring low zombie counts (< 10 is often fine): A single-digit zombie count is not worth emergency response — transient zombies appear and disappear on healthy systems. Page only when the count is growing rapidly or approaching hundreds.

Cross-References


Wiki Navigation