Skip to content

Pattern: PID Exhaustion via Zombies

ID: FP-006 Family: Resource Exhaustion Frequency: Uncommon Blast Radius: Single Host/Node Detection Difficulty: Subtle

The Shape

A parent process forks children but never calls waitpid() to reap them. After the child exits, the OS keeps its process table entry as a zombie (defunct) until the parent reaps it. Zombies consume a PID but no CPU or meaningful memory. When the PID table fills up (/proc/sys/kernel/pid_max), no new processes can be created on the host — not even SSH sessions, not ls, not anything.

How You'll See It

In Linux/Infrastructure

$ ps aux | grep defunct | wc -l
4187

$ cat /proc/sys/kernel/pid_max
32768

$ bash: fork: retry: Resource temporarily unavailable
The host looks healthy: CPU low, memory fine, no disk issues. But fork() fails. SSH connections are refused not because of network problems but because sshd cannot fork a handler process. top shows thousands of <defunct> entries.

In Kubernetes

The PID limit on a pod (podSpec.containers[].resources.limits.pids) is hit before the host PID max. New exec commands to containers fail. kubectl exec returns error even though the container appears healthy.

In CI/CD

A CI agent that runs test subprocesses without properly reaping them accumulates zombies over the course of a day. After several hundred builds, the agent cannot fork new processes. Jobs hang at "starting test runner" — the runner cannot be forked.

The Tell

ps aux | grep defunct | wc -l returns a large number. New process creation (fork(), exec()) fails with "Resource temporarily unavailable." cat /proc/sys/kernel/pid_max is close to the PID count.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
OOM PID exhaustion free -h shows memory available; ps shows thousands of defunct entries
Network failure SSH fork failure SSH connects but sshd cannot fork handler; dmesg shows fork errors
Application deadlock PID table full Application is healthy; only new process creation fails

The Fix (Generic)

  1. Immediate: Kill the parent process (causing all its zombies to be adopted by init/systemd, which will reap them). Alternatively, send SIGCHLD to the parent to prompt reaping.
  2. Short-term: Restart the parent process/service with the bug; zombies clear on restart.
  3. Long-term: Fix the parent to call waitpid() (or signal(SIGCHLD, SIG_IGN) to auto-reap); add zombie count monitoring (ps aux | grep defunct | wc -l) with an alert at 100.

Real-World Examples

  • Example 1: CI/CD build agent accumulated 4,187 zombies over 3 days (FP-006 canonical case). The agent's build runner forked processes for each step but didn't wait for them after a job abort. PID table wasn't exhausted but was approaching 13,000/32,768. Restart cleared it; bug was fixed in next agent release.
  • Example 2: Monitoring agent that executed plugin scripts (nagios-style) didn't reap them on timeout. After 2 weeks, 8,000 zombies. The monitoring agent itself became unable to launch new checks.

War Story

A developer reported that SSH to the CI agent was "broken" — telnet worked, but SSH sessions would connect and then immediately drop. I was confused: the SSH handshake completed, the banner appeared, then nothing. dmesg showed a flood of "fork: retry: Resource temporarily unavailable". Checked ps aux | grep defunct: 4,187 zombie processes. The agent had been running for 72 hours without a restart. Killing the agent process (PID 1 of its cgroup) cleared all zombies instantly; SSH worked again in 10 seconds.

Cross-References