Skip to content

Solution

Triage

  1. Count zombie processes:
    ps aux | awk '$8 == "Z" || $8 == "Z+"' | wc -l
    
  2. Identify the parent of the zombies:
    ps -eo pid,ppid,stat,cmd | grep " Z" | head -10
    ps -p 1842 -o pid,cmd    # check the common PPID
    
  3. Verify PID space consumption:
    cat /proc/sys/kernel/pid_max
    ps -e --no-headers | wc -l
    
  4. Check if the parent handles SIGCHLD:
    cat /proc/1842/status | grep SigCgt
    # Decode the signal mask to see if bit 17 (SIGCHLD) is set
    

Root Cause

The CI/CD build agent (PID 1842) spawns child processes using fork()/exec() for each build step (compilation, testing, artifact packaging). After each child completes, the parent does not call waitpid() to collect the exit status. The parent also does not set SIGCHLD to SIG_IGN (which would auto-reap on Linux) and has no signal handler for SIGCHLD.

Each completed child becomes a zombie: it retains its entry in the process table (with its PID and exit status) waiting for the parent to read the exit code. Over 3 days of continuous builds, 4,187 zombies have accumulated.

Zombies consume no CPU or memory, but each occupies one PID. The default pid_max is 32768. At the current rate, the PID space will be exhausted within a few more days, preventing any new processes from being created on the system.

Fix

Immediate (clear the zombies):

  1. Try sending SIGCHLD to the parent to trigger reaping (only works if a handler exists):

    kill -SIGCHLD 1842
    

  2. If that does not help, restart the build agent:

    systemctl restart build-agent
    
    When the parent exits, all zombies are reparented to PID 1 (systemd), which automatically reaps them.

  3. Verify zombies are cleared:

    ps aux | awk '$8 == "Z"' | wc -l
    

Permanent fix:

  1. Fix the application code. The parent must either:
  2. Call waitpid(-1, &status, WNOHANG) in a loop after each child is spawned.
  3. Install a SIGCHLD handler that calls waitpid().
  4. Set signal(SIGCHLD, SIG_IGN) to auto-reap children (if exit codes are not needed).

  5. If the application cannot be modified, wrap it with a process that acts as a subreaper:

    # Use tini or dumb-init as PID 1 in the service
    ExecStart=/usr/bin/tini -- /opt/build-agent/bin/agent
    

Rollback / Safety

  • Restarting the build agent will terminate any in-progress builds. Schedule during a maintenance window or wait for builds to complete.
  • kill -SIGCHLD is non-destructive; it only delivers a signal the parent may or may not handle.
  • Increasing pid_max is a temporary workaround, not a fix: sysctl kernel.pid_max=65536.

Common Traps

  • Trying to kill -9 zombie processes. Zombies are already dead. SIGKILL has no effect. The parent must reap them.
  • Killing the wrong parent. Ensure you identify the correct PPID. If the parent is PID 1 (init/systemd), zombies should already be reaped automatically; the issue is elsewhere.
  • Confusing zombies with orphans. An orphan is a running process whose parent died (reparented to PID 1). A zombie is a dead process whose parent has not collected its exit status.
  • Ignoring the problem because zombies use no resources. While true for CPU/memory, PID exhaustion is a real risk on busy systems.
  • Not fixing the code. Restarting the parent periodically is a workaround, not a solution.