Pattern: Zombie Process Accumulation¶
ID: FP-028 Family: Silent Corruption Frequency: Uncommon Blast Radius: Single Host Detection Difficulty: Subtle
The Shape¶
Zombie processes (defunct) accumulate over time when parent processes don't call
waitpid() to reap children. The accumulation is slow and invisible — no CPU, no
memory, just PID table slots. Over hours or days, the count grows. Eventually, new
process creation fails across the entire host, including unrelated services, SSH
sessions, and system commands. The host appears healthy until it suddenly isn't.
How You'll See It¶
In Linux/Infrastructure¶
$ ps aux | grep defunct
user 12345 0.0 0.0 0 0 ? Z 09:15 0:00 [worker] <defunct>
...
$ ps aux | grep defunct | wc -l
4187
$ cat /proc/sys/kernel/pid_max
32768
$ cat /proc/sys/kernel/pid_max; ps aux | wc -l
32768
31900 # approaching ceiling
top starts returning errors. System logs:
fork: retry: Resource temporarily unavailable.
In Kubernetes¶
Container runtime (containerd, dockerd) fails to launch new containers. Kubelet
reports: "failed to create containerd task" or "failed to fork." The node appears
Ready but cannot schedule new pods. Existing pods run fine.
In CI/CD¶
A build agent processes jobs. Each job spawns subprocesses. After 500 jobs without agent restart, the PID table is filling. New job steps fail with fork errors. The agent appears healthy (it's running, accepting jobs) but cannot execute them.
The Tell¶
ps aux | grep defunct | wc -lreturns a large number (hundreds to thousands). New process creation fails:fork: retry: Resource temporarily unavailable.ps aux | wc -lapproaches/proc/sys/kernel/pid_max. CPU and memory are normal; only new fork/exec operations fail.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| OOM | PID exhaustion | free -h shows memory available; ps shows zombies |
| Network failure | SSH can't fork handler | Network is fine; SSH handshake completes but login process can't start |
| Random service failures | PID table full | All new process creation fails; not specific to one service |
The Fix (Generic)¶
- Immediate: Kill the parent process of the zombies (init/systemd will reap them); zombies are cleared when their parent exits or is killed.
- Short-term: Restart the service with the zombie-leaking bug.
- Long-term: Fix the parent to call
waitpid()or setSA_NOCLDWAIT; monitor zombie count withps aux | grep defunct | wc -land alert at 100.
Real-World Examples¶
- Example 1: CI/CD agent accumulated 4,187 zombies over 72 hours. Agent was processing aborted jobs and not waiting for child processes to exit. PID table was 13% full. Restarting the agent cleared all zombies in 5 seconds.
- Example 2: Monitoring agent that ran plugin scripts in subprocesses. After 14 days without restart, 8,000 zombies. New monitoring checks could not be launched (fork failed). Monitoring of the host became unreliable.
War Story¶
We noticed SSH logins to our CI node were hanging after the welcome banner. Nothing in auth logs. Tried
ps aux— the command itself hung for 30 seconds then returned a 31,000-line output full of<defunct>. The PID table was 96% full. We couldn't evenkillthe zombies directly (you can't kill a zombie). We sentSIGTERMto the parent process (the CI agent), which caused it to exit, causing init to reparent and immediately reap all 31,000 zombies. PID table cleared in 2 seconds. SSH worked again immediately.
Cross-References¶
- Topic Packs: linux-ops, cicd
- Case Studies: linux_ops/zombie-processes-accumulating/
- Related Patterns: FP-006 (PID exhaustion zombies — same pattern described from resource-exhaustion angle), FP-004 (OOM without swap — shares "new process creation fails" symptom)