Pattern: PID Exhaustion via Zombies¶

ID: FP-006 Family: Resource Exhaustion Frequency: Uncommon Blast Radius: Single Host/Node Detection Difficulty: Subtle

The Shape¶

A parent process forks children but never calls waitpid() to reap them. After the child exits, the OS keeps its process table entry as a zombie (defunct) until the parent reaps it. Zombies consume a PID but no CPU or meaningful memory. When the PID table fills up (/proc/sys/kernel/pid_max), no new processes can be created on the host — not even SSH sessions, not ls, not anything.

How You'll See It¶

In Linux/Infrastructure¶

$ ps aux | grep defunct | wc -l
4187

$ cat /proc/sys/kernel/pid_max
32768

$ bash: fork: retry: Resource temporarily unavailable

The host looks healthy: CPU low, memory fine, no disk issues. But fork() fails. SSH connections are refused not because of network problems but because sshd cannot fork a handler process. top shows thousands of <defunct> entries.

In Kubernetes¶

The PID limit on a pod (podSpec.containers[].resources.limits.pids) is hit before the host PID max. New exec commands to containers fail. kubectl exec returns error even though the container appears healthy.

In CI/CD¶

A CI agent that runs test subprocesses without properly reaping them accumulates zombies over the course of a day. After several hundred builds, the agent cannot fork new processes. Jobs hang at "starting test runner" — the runner cannot be forked.

The Tell¶

ps aux | grep defunct | wc -l returns a large number. New process creation (fork(), exec()) fails with "Resource temporarily unavailable." cat /proc/sys/kernel/pid_max is close to the PID count.

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
OOM	PID exhaustion	`free -h` shows memory available; `ps` shows thousands of defunct entries
Network failure	SSH fork failure	SSH connects but sshd cannot fork handler; `dmesg` shows fork errors
Application deadlock	PID table full	Application is healthy; only new process creation fails

The Fix (Generic)¶

Immediate: Kill the parent process (causing all its zombies to be adopted by init/systemd, which will reap them). Alternatively, send SIGCHLD to the parent to prompt reaping.
Short-term: Restart the parent process/service with the bug; zombies clear on restart.
Long-term: Fix the parent to call waitpid() (or signal(SIGCHLD, SIG_IGN) to auto-reap); add zombie count monitoring (ps aux | grep defunct | wc -l) with an alert at 100.

Real-World Examples¶

Example 1: CI/CD build agent accumulated 4,187 zombies over 3 days (FP-006 canonical case). The agent's build runner forked processes for each step but didn't wait for them after a job abort. PID table wasn't exhausted but was approaching 13,000/32,768. Restart cleared it; bug was fixed in next agent release.
Example 2: Monitoring agent that executed plugin scripts (nagios-style) didn't reap them on timeout. After 2 weeks, 8,000 zombies. The monitoring agent itself became unable to launch new checks.

War Story¶

A developer reported that SSH to the CI agent was "broken" — telnet worked, but SSH sessions would connect and then immediately drop. I was confused: the SSH handshake completed, the banner appeared, then nothing. dmesg showed a flood of "fork: retry: Resource temporarily unavailable". Checked ps aux | grep defunct: 4,187 zombie processes. The agent had been running for 72 hours without a restart. Killing the agent process (PID 1 of its cgroup) cleared all zombies instantly; SSH worked again in 10 seconds.

Cross-References¶

Topic Packs: linux-ops, cicd
Case Studies: linux_ops/zombie-processes-accumulating/
Footguns: linux-ops/footguns.md — "Zombie processes accumulating"
Related Patterns: FP-028 (zombie PID exhaustion — same pattern, different scope lens), FP-004 (OOM — shares "process creation fails" symptom)