Skip to content

Incident Replay: Zombie Processes Accumulating

Setup

  • System context: Production job runner server that spawns child processes to execute tasks. Over time, the PID table is filling with zombie processes, approaching the system PID limit.
  • Time: Wednesday 16:00 UTC
  • Your role: On-call SRE / Linux engineer

Round 1: Alert Fires

[Pressure cue: "Monitoring alert — server job-runner-01 has 28,000 zombie processes. PID limit is 32,768. New processes are starting to fail with 'Resource temporarily unavailable.'"]

What you see: ps aux | grep -c Z shows 28,147 zombie processes. ps aux | grep Z | head shows they are all defunct children of the job runner process (PID 1847). The job runner is still running.

Choose your action: - A) Kill the zombie processes with kill -9 - B) Identify why the parent process is not reaping its children - C) Increase the PID limit: sysctl kernel.pid_max - D) Restart the job runner service

[Result: You cannot kill a zombie (it is already dead). The parent (PID 1847) must call wait() to reap them. Checking the job runner source: it spawns children with fork() but the SIGCHLD handler was removed in a recent code change (3 days ago). Children exit but are never waited on. Proceed to Round 2.]

If you chose A:

[Result: kill -9 on a zombie does nothing — zombies are already dead. They are entries in the process table waiting to be reaped by their parent. You cannot kill them.]

If you chose C:

[Result: Increasing pid_max delays the crisis but zombies continue accumulating. Eventually you hit the higher limit too.]

If you chose D:

[Result: Restarting the job runner kills the parent. Zombies become children of init (PID 1), which reaps them automatically. But you lose all in-flight jobs and do not fix the code bug.]

Round 2: First Triage Data

[Pressure cue: "PID table at 87%. New task spawning is failing. Job queue backing up."]

What you see: The code change 3 days ago replaced the SIGCHLD signal handler (which called waitpid()) with an asynchronous notification system that was never fully implemented. Child processes exit but waitpid() is never called.

Choose your action: - A) Send SIGCHLD to the parent to trigger a reap attempt - B) Kill the parent process to let init reap the zombies, then restart with the fix - C) Deploy the fix (restore SIGCHLD handler) and gracefully restart - D) Write a helper script that calls waitpid() for the parent

[Result: The SIGCHLD handler fix is a one-line code change. Deploy it. Graceful restart of the job runner: in-flight jobs complete, parent exits (init reaps zombies), new parent starts with the fix. Zombie count drops to 0. Proceed to Round 3.]

If you chose A:

[Result: The parent has no SIGCHLD handler installed — sending SIGCHLD does nothing. The handler was removed.]

If you chose B:

[Result: Works but kills all in-flight jobs. Graceful restart is better.]

If you chose D:

[Result: You cannot call waitpid() from outside the parent process. It must be done by the parent itself.]

Round 3: Root Cause Identification

[Pressure cue: "Zombies cleared. Why did this change go to production?"]

What you see: Root cause: A developer removed the SIGCHLD handler while refactoring the job runner's event loop. The async notification replacement was incomplete. Code review did not catch the missing waitpid() call. The staging test suite runs single-threaded and never forks, so the bug was not exercised.

Choose your action: - A) Add a test that exercises the fork/wait lifecycle - B) Add zombie process monitoring and alerting - C) Add a code review checklist item for process lifecycle management - D) All of the above

[Result: Integration test added that forks 100 children and verifies reaping. Zombie count monitoring with alert at 100. Code review checklist updated. Proceed to Round 4.]

If you chose A:

[Result: Test catches the regression in CI but does not detect it in production if it slips through.]

If you chose B:

[Result: Monitoring detects the problem earlier but does not prevent it.]

If you chose C:

[Result: Checklist helps but is only as good as the reviewer's diligence.]

Round 4: Remediation

[Pressure cue: "Job runner healthy. Verify and close."]

Actions: 1. Verify zero zombies: ps aux | grep -c Z returns 0 (or near 0) 2. Verify job processing is normal: check job queue depth 3. Verify PID usage is healthy: ls /proc | grep -c '^[0-9]' 4. Deploy the SIGCHLD handler fix permanently 5. Add zombie count monitoring to the server fleet

Damage Report

  • Total downtime: 0 (job runner was running but unable to spawn new tasks)
  • Blast radius: Job processing degraded for 3 days (zombies accumulated gradually); critical failures in last 2 hours
  • Optimal resolution time: 15 minutes (identify missing waitpid -> fix SIGCHLD handler -> restart)
  • If every wrong choice was made: 2+ hours with failed kill attempts and PID exhaustion

Cross-References