Pattern: PID Exhaustion via Zombies¶
ID: FP-006 Family: Resource Exhaustion Frequency: Uncommon Blast Radius: Single Host/Node Detection Difficulty: Subtle
The Shape¶
A parent process forks children but never calls waitpid() to reap them. After the child
exits, the OS keeps its process table entry as a zombie (defunct) until the parent reaps it.
Zombies consume a PID but no CPU or meaningful memory. When the PID table fills up
(/proc/sys/kernel/pid_max), no new processes can be created on the host — not even SSH
sessions, not ls, not anything.
How You'll See It¶
In Linux/Infrastructure¶
$ ps aux | grep defunct | wc -l
4187
$ cat /proc/sys/kernel/pid_max
32768
$ bash: fork: retry: Resource temporarily unavailable
fork() fails.
SSH connections are refused not because of network problems but because sshd cannot
fork a handler process. top shows thousands of <defunct> entries.
In Kubernetes¶
The PID limit on a pod (podSpec.containers[].resources.limits.pids) is hit before
the host PID max. New exec commands to containers fail. kubectl exec returns error
even though the container appears healthy.
In CI/CD¶
A CI agent that runs test subprocesses without properly reaping them accumulates zombies over the course of a day. After several hundred builds, the agent cannot fork new processes. Jobs hang at "starting test runner" — the runner cannot be forked.
The Tell¶
ps aux | grep defunct | wc -lreturns a large number. New process creation (fork(),exec()) fails with "Resource temporarily unavailable."cat /proc/sys/kernel/pid_maxis close to the PID count.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| OOM | PID exhaustion | free -h shows memory available; ps shows thousands of defunct entries |
| Network failure | SSH fork failure | SSH connects but sshd cannot fork handler; dmesg shows fork errors |
| Application deadlock | PID table full | Application is healthy; only new process creation fails |
The Fix (Generic)¶
- Immediate: Kill the parent process (causing all its zombies to be adopted by init/systemd, which will reap them). Alternatively, send SIGCHLD to the parent to prompt reaping.
- Short-term: Restart the parent process/service with the bug; zombies clear on restart.
- Long-term: Fix the parent to call
waitpid()(orsignal(SIGCHLD, SIG_IGN)to auto-reap); add zombie count monitoring (ps aux | grep defunct | wc -l) with an alert at 100.
Real-World Examples¶
- Example 1: CI/CD build agent accumulated 4,187 zombies over 3 days (FP-006 canonical case). The agent's build runner forked processes for each step but didn't wait for them after a job abort. PID table wasn't exhausted but was approaching 13,000/32,768. Restart cleared it; bug was fixed in next agent release.
- Example 2: Monitoring agent that executed plugin scripts (nagios-style) didn't reap them on timeout. After 2 weeks, 8,000 zombies. The monitoring agent itself became unable to launch new checks.
War Story¶
A developer reported that SSH to the CI agent was "broken" — telnet worked, but SSH sessions would connect and then immediately drop. I was confused: the SSH handshake completed, the banner appeared, then nothing.
dmesgshowed a flood of "fork: retry: Resource temporarily unavailable". Checkedps aux | grep defunct: 4,187 zombie processes. The agent had been running for 72 hours without a restart. Killing the agent process (PID 1 of its cgroup) cleared all zombies instantly; SSH worked again in 10 seconds.
Cross-References¶
- Topic Packs: linux-ops, cicd
- Case Studies: linux_ops/zombie-processes-accumulating/
- Footguns: linux-ops/footguns.md — "Zombie processes accumulating"
- Related Patterns: FP-028 (zombie PID exhaustion — same pattern, different scope lens), FP-004 (OOM — shares "process creation fails" symptom)