- linux
- l2
- runbook
- process-management
- linux-signals-process-control --- Portal | Level: L2: Operations | Topics: Process Management, Linux Signals & Process Control | Domain: Linux
Runbook: Zombie Processes Accumulating¶
| Field | Value |
|---|---|
| Domain | Linux |
| Alert | node_processes_state{state="Z"} > 10 sustained |
| Severity | P3 |
| Est. Resolution Time | 20-40 minutes |
| Escalation Timeout | 60 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to the node, sudo access |
Quick Assessment (30 seconds)¶
If output shows: Fewer than 10 zombie processes AND system otherwise stable → Low urgency, schedule investigation rather than emergency response If output shows: Dozens or hundreds of zombies → Parent process is leaking children rapidly, continue urgently from Step 1Step 1: Count Zombie Processes and Check System Health¶
Why: The number of zombies tells you severity. A handful (< 10) has no real system impact. Hundreds can exhaust the process table and prevent any new processes from starting.
# Count total zombie processes
ps aux | awk '$8 == "Z"' | wc -l
# Show all zombie processes with their PID and PPID
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/'
# Check if PID limit is being approached
cat /proc/sys/kernel/pid_max
ps aux | wc -l
# Check system load (zombies themselves use no CPU, but their parent may)
uptime
ps itself hangs or is very slow, the system is under extreme process pressure. Run cat /proc/loadavg instead for a quick load check.
Step 2: Identify Parent PIDs¶
Why: Zombies cannot be killed — they are already dead. The only fix is to make their parent process collect them using the wait() system call. You must find the parent first.
# For each zombie, find its parent PID (PPID)
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/ {print $2}' | sort -u
# Look up what process owns each PPID
ps -p <PPID> -o pid,ppid,stat,comm,args
# Show full process tree to understand the parent-child relationship
pstree -p <PPID>
# Or use /proc to look up the parent
cat /proc/<ZOMBIE_PID>/status | grep -E "Name|Pid|PPid|State"
# The parent process running, accumulating zombie children:
PID PPID STAT COMMAND
5678 1234 S my-worker-daemon
Step 3: Check Parent Process Status¶
Why: The parent may be stuck (hung, deadlocked, or CPU-starved) and unable to call wait() to reap its children. Understanding the parent's state guides the fix.
# Check parent process state
ps -p <PPID> -o pid,stat,wchan,comm,args
# Check if parent is in uninterruptible sleep (D state) — this is a problem
cat /proc/<PPID>/wchan
# Check parent's open file descriptors (a very large number can indicate a leak)
ls -l /proc/<PPID>/fd | wc -l
# Check parent logs for errors
sudo journalctl -u <SERVICE_NAME_FOR_PPID> --since "30 minutes ago" | tail -50
# Healthy parent: in S (sleeping) or R (running) state, not D (uninterruptible)
PID STAT WCHAN COMMAND
5678 S poll_schedule my-worker-daemon
D (uninterruptible sleep), it is waiting on I/O or a kernel resource and cannot reap children. This is more serious — check for disk I/O issues (iostat -x 1 5) or kernel deadlocks (sudo dmesg | tail -30).
Step 4: Send SIGCHLD to Parent to Trigger Reaping¶
Why: SIGCHLD is the signal that tells a parent process one of its children has changed state. Sending it manually can prompt a well-written parent to call wait() and clean up zombie children.
# Send SIGCHLD to the parent process
sudo kill -SIGCHLD <PPID>
# Wait a few seconds and check if zombies decreased
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l
# If the parent is a well-written daemon, zombies should decrease
# Send again if count dropped — indicates SIGCHLD is working but there are many children
sudo kill -SIGCHLD <PPID>
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l
# Zombie count should decrease after each SIGCHLD
Before: 45 zombies
After SIGCHLD: 12 zombies
After second SIGCHLD: 0 zombies
wait() when it receives the signal. Proceed to restart the parent in Step 5.
Step 5: If Parent Is Buggy — Restart It¶
Why: A parent that refuses to reap its children has a bug. Restarting it causes all its children (including zombies) to be reparented to PID 1 (init/systemd), which will reap them.
# Check if the service is managed by systemd
systemctl status <SERVICE_NAME>
# Gracefully restart via systemd (preferred — handles dependency ordering)
sudo systemctl restart <SERVICE_NAME>
# Verify zombies were cleaned up after parent restart
sleep 5
ps -eo pid,ppid,stat,comm | awk '$3 ~ /^Z/' | wc -l
# If not managed by systemd, gracefully restart the process
sudo kill -SIGTERM <PPID>
sleep 10
ps -p <PPID> # Should be gone; systemd/init will restart if configured
kubectl drain <NODE_NAME> --ignore-daemonsets or shift load away from this node.
Step 6: If Systemd Process — Check the Unit File¶
Why: If the parent is a systemd-managed service, incorrect KillMode or ExecStop settings can prevent systemd from properly waiting on child processes.
# Check the unit file for child process handling
sudo systemctl cat <SERVICE_NAME>
# Key settings to check:
# KillMode=control-group (recommended — kills all processes in the cgroup)
# KillMode=process (only kills main process — may leave children)
# TimeoutStopSec=<N> (how long systemd waits for graceful stop)
# Check if the cgroup still has processes after stop
sudo systemctl stop <SERVICE_NAME>
sudo systemd-cgls /system.slice/<SERVICE_NAME>.service
# Fix KillMode if needed (create an override)
sudo systemctl edit <SERVICE_NAME>
# Add:
# [Service]
# KillMode=control-group
# After fixing KillMode and restarting:
# systemctl stop <SERVICE_NAME> should kill all child processes
# No zombies should remain after stop
Verification¶
Success looks like: Zombie count is 0 or below 5 (a tiny number of transient zombies is normal on a busy system). Alert has cleared. If still broken: Escalate — see below.Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 60 min | Application team on-call | "Zombie processes accumulating from parent PID |
| Data loss suspected | Application team lead | "Parent process |
| Scope expanding | SRE lead | "Zombie accumulation pattern affecting multiple services/nodes, possible kernel or init system regression" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Trying to kill zombie processes directly: This never works — you cannot kill a zombie because it is already dead (it has exited and its resources have been freed). The entry in the process table is just waiting for the parent to collect the exit status. Only the parent (or init after parent exits) can remove it.
- Not finding the parent: Sending signals to the zombie PID itself does nothing. The entire fix lives with the parent process. Always identify the PPID before taking action.
- Ignoring low zombie counts (< 10 is often fine): A single-digit zombie count is not worth emergency response — transient zombies appear and disappear on healthy systems. Page only when the count is growing rapidly or approaching hundreds.
Cross-References¶
- Topic Pack: Linux Process Management (deep background)
- Related Runbook: Systemd Service Crash Loop
Wiki Navigation¶
Related Content¶
- Linux Processes Flashcards (CLI) (flashcard_deck, L1) — Process Management
- Linux Signals & Process Control (Topic Pack, L1) — Linux Signals & Process Control
- Process Management (Topic Pack, L1) — Process Management
- Runbook: High CPU (Runaway Process) (Runbook, L1) — Process Management