Skip to content

Pattern: Zombie Process Accumulation

ID: FP-028 Family: Silent Corruption Frequency: Uncommon Blast Radius: Single Host Detection Difficulty: Subtle

The Shape

Zombie processes (defunct) accumulate over time when parent processes don't call waitpid() to reap children. The accumulation is slow and invisible — no CPU, no memory, just PID table slots. Over hours or days, the count grows. Eventually, new process creation fails across the entire host, including unrelated services, SSH sessions, and system commands. The host appears healthy until it suddenly isn't.

How You'll See It

In Linux/Infrastructure

$ ps aux | grep defunct
user   12345  0.0  0.0      0     0 ?    Z    09:15   0:00 [worker] <defunct>
...
$ ps aux | grep defunct | wc -l
4187

$ cat /proc/sys/kernel/pid_max
32768
$ cat /proc/sys/kernel/pid_max; ps aux | wc -l
32768
31900   # approaching ceiling
SSH login fails (sshd can't fork). top starts returning errors. System logs: fork: retry: Resource temporarily unavailable.

In Kubernetes

Container runtime (containerd, dockerd) fails to launch new containers. Kubelet reports: "failed to create containerd task" or "failed to fork." The node appears Ready but cannot schedule new pods. Existing pods run fine.

In CI/CD

A build agent processes jobs. Each job spawns subprocesses. After 500 jobs without agent restart, the PID table is filling. New job steps fail with fork errors. The agent appears healthy (it's running, accepting jobs) but cannot execute them.

The Tell

ps aux | grep defunct | wc -l returns a large number (hundreds to thousands). New process creation fails: fork: retry: Resource temporarily unavailable. ps aux | wc -l approaches /proc/sys/kernel/pid_max. CPU and memory are normal; only new fork/exec operations fail.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
OOM PID exhaustion free -h shows memory available; ps shows zombies
Network failure SSH can't fork handler Network is fine; SSH handshake completes but login process can't start
Random service failures PID table full All new process creation fails; not specific to one service

The Fix (Generic)

  1. Immediate: Kill the parent process of the zombies (init/systemd will reap them); zombies are cleared when their parent exits or is killed.
  2. Short-term: Restart the service with the zombie-leaking bug.
  3. Long-term: Fix the parent to call waitpid() or set SA_NOCLDWAIT; monitor zombie count with ps aux | grep defunct | wc -l and alert at 100.

Real-World Examples

  • Example 1: CI/CD agent accumulated 4,187 zombies over 72 hours. Agent was processing aborted jobs and not waiting for child processes to exit. PID table was 13% full. Restarting the agent cleared all zombies in 5 seconds.
  • Example 2: Monitoring agent that ran plugin scripts in subprocesses. After 14 days without restart, 8,000 zombies. New monitoring checks could not be launched (fork failed). Monitoring of the host became unreliable.

War Story

We noticed SSH logins to our CI node were hanging after the welcome banner. Nothing in auth logs. Tried ps aux — the command itself hung for 30 seconds then returned a 31,000-line output full of <defunct>. The PID table was 96% full. We couldn't even kill the zombies directly (you can't kill a zombie). We sent SIGTERM to the parent process (the CI agent), which caused it to exit, causing init to reparent and immediately reap all 31,000 zombies. PID table cleared in 2 seconds. SSH worked again immediately.

Cross-References

  • Topic Packs: linux-ops, cicd
  • Case Studies: linux_ops/zombie-processes-accumulating/
  • Related Patterns: FP-006 (PID exhaustion zombies — same pattern described from resource-exhaustion angle), FP-004 (OOM without swap — shares "new process creation fails" symptom)