Solution¶
Triage¶
- Count zombie processes:
- Identify the parent of the zombies:
- Verify PID space consumption:
- Check if the parent handles SIGCHLD:
Root Cause¶
The CI/CD build agent (PID 1842) spawns child processes using fork()/exec() for each build step (compilation, testing, artifact packaging). After each child completes, the parent does not call waitpid() to collect the exit status. The parent also does not set SIGCHLD to SIG_IGN (which would auto-reap on Linux) and has no signal handler for SIGCHLD.
Each completed child becomes a zombie: it retains its entry in the process table (with its PID and exit status) waiting for the parent to read the exit code. Over 3 days of continuous builds, 4,187 zombies have accumulated.
Zombies consume no CPU or memory, but each occupies one PID. The default pid_max is 32768. At the current rate, the PID space will be exhausted within a few more days, preventing any new processes from being created on the system.
Fix¶
Immediate (clear the zombies):
-
Try sending SIGCHLD to the parent to trigger reaping (only works if a handler exists):
-
If that does not help, restart the build agent:
When the parent exits, all zombies are reparented to PID 1 (systemd), which automatically reaps them. -
Verify zombies are cleared:
Permanent fix:
- Fix the application code. The parent must either:
- Call
waitpid(-1, &status, WNOHANG)in a loop after each child is spawned. - Install a SIGCHLD handler that calls
waitpid(). -
Set
signal(SIGCHLD, SIG_IGN)to auto-reap children (if exit codes are not needed). -
If the application cannot be modified, wrap it with a process that acts as a subreaper:
Rollback / Safety¶
- Restarting the build agent will terminate any in-progress builds. Schedule during a maintenance window or wait for builds to complete.
kill -SIGCHLDis non-destructive; it only delivers a signal the parent may or may not handle.- Increasing
pid_maxis a temporary workaround, not a fix:sysctl kernel.pid_max=65536.
Common Traps¶
- Trying to
kill -9zombie processes. Zombies are already dead. SIGKILL has no effect. The parent must reap them. - Killing the wrong parent. Ensure you identify the correct PPID. If the parent is PID 1 (init/systemd), zombies should already be reaped automatically; the issue is elsewhere.
- Confusing zombies with orphans. An orphan is a running process whose parent died (reparented to PID 1). A zombie is a dead process whose parent has not collected its exit status.
- Ignoring the problem because zombies use no resources. While true for CPU/memory, PID exhaustion is a real risk on busy systems.
- Not fixing the code. Restarting the parent periodically is a workaround, not a solution.