The Hanging Deploy
- lesson
- processes
- signals
- systemd
- bash-job-control
- containers
- cgroups ---# The Hanging Deploy
Topics: processes, signals, systemd, bash job control, containers, cgroups Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's 2pm on a Thursday. You ran a deploy script and it just... stopped. The terminal is frozen. No output. No error. Ctrl+C does nothing. Your colleague asks "is the deploy done?" and you don't know.
This happens more than anyone admits. And the fix depends on understanding something most people never learn properly: what a process actually is, how processes talk to each other, and what happens when they don't.
By the end of this lesson you'll understand: - What your shell actually does when you run a command - Why processes hang, and the systematic way to unstick them - How signals work (and why Ctrl+C sometimes doesn't) - What zombies are, why they exist, and why they're usually harmless - How systemd wraps all of this, and how containers change the rules
We'll start at the bottom — what the kernel actually does when you type a command — and work our way up through signals, job control, systemd, and containers. Each layer builds on the one before it.
Part 1: What Just Happened When You Ran That Script¶
When you typed ./deploy.sh and hit enter, your shell did something beautifully simple
that's been the same since 1969: it called fork(), then exec().
# This is what your shell does internally (simplified):
pid = fork() # Create an exact copy of the shell process
if pid == 0: # In the child process:
exec("./deploy.sh") # Replace yourself with the deploy script
else: # In the parent (your shell):
waitpid(pid) # Sit here until the child exits
That's it. That's the whole model. Every command you've ever run in a terminal works this way.
Under the Hood:
fork()doesn't actually copy all the memory — it uses copy-on-write (COW). The parent and child share the same physical memory pages until one of them writes to a page, at which point the kernel copies just that page. This makesfork()fast even for processes using gigabytes of RAM. Redis exploits this for background saves — it forks, the child writes the snapshot to disk, and because most pages are read-only, almost no memory is copied.Name Origin:
fork()was adapted from Multics for the original Unix at Bell Labs. The Unix twist: the child is an exact duplicate of the parent, distinguished only by the return value. In the child,fork()returns 0. In the parent, it returns the child's PID. This elegant split means the child can set up redirections, change directories, or drop privileges betweenfork()andexec()— which is why shell I/O redirection works at all.
Why this matters for your stuck deploy¶
Your shell is sitting in waitpid() — it's blocked, waiting for the child process (your
deploy script) to exit. The script hasn't exited. That's why your terminal is frozen.
But here's the thing: deploy.sh probably isn't one process. It's a tree of processes. The
script calls ssh, which calls apt, which calls dpkg, which forks workers. Any one of
those descendants could be the one that's stuck.
Let's find out which one. Open another terminal:
You might see something like:
Now you can see the whole family. The stuck process is probably apt-get or dpkg —
waiting for a lock, waiting for a prompt, or waiting for network I/O.
# What is the stuck process doing right now?
cat /proc/14601/wchan
# → "do_wait" or "poll_schedule_timeout" or something else revealing
# What state is it in?
ps -o pid,stat,wchan,comm -p 14601
# → 14601 S poll_schedule_timeout dpkg
The stat column tells you the process state:
| State | Meaning | Can you kill it? |
|---|---|---|
R |
Running or runnable | Yes |
S |
Sleeping (waiting for something — I/O, timer, signal) | Yes |
D |
Uninterruptible sleep (waiting for disk/NFS I/O) | No — not even SIGKILL |
Z |
Zombie (exited but parent hasn't read the exit code) | Already dead — kill the parent instead |
T |
Stopped (Ctrl+Z or SIGSTOP) | Yes, resume with SIGCONT |
Gotcha: If you see
Dstate, you're in trouble. The process is stuck in a kernel I/O operation — usually NFS, a flaky disk, or a FUSE mount. No signal can reach it because the kernel won't deliver signals during uninterruptible I/O. You have to fix the I/O subsystem (restart NFS, fix the disk) or reboot. This is the only process state wherekill -9genuinely does nothing.
Part 2: Talking to Processes — Signals¶
Your deploy is stuck. Time to communicate with it. In Unix, processes communicate through signals — small numbered messages that the kernel delivers.
When you hit Ctrl+C, you're sending signal 2 (SIGINT). When you run kill <pid>, you're
sending signal 15 (SIGTERM). These aren't magic — they're just integers with conventions
attached.
The ones you'll actually use:
| Signal | Number | Keyboard | What it means |
|---|---|---|---|
| SIGHUP | 1 | — | "Hang up" — originally meant the modem disconnected. Now means "reload your config" for most daemons |
| SIGINT | 2 | Ctrl+C | "Interrupt" — polite request to stop |
| SIGKILL | 9 | — | "Die now" — kernel terminates the process, no handler possible |
| SIGTERM | 15 | — | "Terminate" — polite request to exit cleanly. This is the default kill signal |
| SIGSTOP | 19 | Ctrl+Z | "Freeze" — process is suspended, can be resumed |
| SIGCONT | 18 | — | "Continue" — resume a stopped process |
Name Origin: SIGHUP stands for "Signal Hang Up" — from the days of telephone modems. When the physical phone line disconnected, the terminal driver sent SIGHUP to all processes on that terminal. Today,
nohupliterally means "immune to hangup" — it blocks SIGHUP so your process survives when you close the terminal. Many daemons reinterpret SIGHUP as "please reload your configuration file," which is whykill -HUP nginxreloads nginx without dropping connections.
Why Ctrl+C didn't work¶
Here's the thing most people miss: Ctrl+C doesn't send SIGINT to the process you're looking at. It sends SIGINT to the entire foreground process group.
# See what process group your deploy script is in
ps -o pid,pgid,comm -p 14523
# → 14523 14523 deploy.sh
# See all processes in that group
ps -o pid,pgid,comm -g 14523
But there's a problem: if deploy.sh launched ssh, and ssh launched something on a
remote server, the remote process is NOT in your local process group. Ctrl+C kills the local
ssh client, but the remote apt-get keeps running — headless, parentless, doing whatever
it was doing.
This is why Ctrl+C "didn't work" — the actual stuck process was remote, and killing the SSH client just disconnected you from it.
Mental Model: Think of signals like phone calls. SIGTERM is a call saying "please hang up." The process can answer and handle it gracefully — flush buffers, close connections, write a final log line. SIGKILL is the phone company cutting the line — no negotiation, no cleanup, no last words. Only two signals can't be caught or ignored: SIGKILL (9) and SIGSTOP (19). Everything else is negotiable.
The correct kill sequence¶
Never start with kill -9. Always:
# Step 1: Ask politely (SIGTERM, the default)
kill $PID
# Step 2: Wait for it to clean up (10-30 seconds depending on the service)
sleep 10
# Step 3: Check if it's still alive (kill -0 sends no signal, just checks)
kill -0 $PID 2>/dev/null && echo "still alive" || echo "dead"
# Step 4: Only if still alive, force it
kill -0 $PID 2>/dev/null && kill -9 $PID
Remember: "TERM asks, KILL takes." SIGTERM (15) = polite, SIGKILL (9) = forced. Exit code = 128 + signal number, so SIGTERM death = exit 143, SIGKILL death = exit 137. When you see exit code 137 in your CI logs, that's the OOM killer or a
kill -9.War Story: A team had a deploy script that called
kill -9as the very first step when restarting their Java service. The JVM never got a chance to flush write-ahead logs or close database connections cleanly. For months they had intermittent data corruption — transactions would half-complete, with money debited but not credited. The fix was changingkill -9tokill -TERM, adding a 30-second timeout, and only escalating to SIGKILL if the graceful shutdown failed.
Part 3: Bash Job Control — Background, Foreground, and Traps¶
Back to your stuck deploy. You've identified the stuck process and killed it. But now you want to prevent this from happening again. Let's look at how bash manages processes and how to write scripts that don't get stuck.
Running things in the background¶
# Run a command in the background
long-running-task &
# $! is the PID of the last backgrounded process
echo "Started task with PID $!"
# List background jobs
jobs
# → [1]+ Running long-running-task &
# Bring it back to the foreground
fg %1
# Or suspend whatever's in the foreground
# Ctrl+Z
# And resume it in the background
bg %1
Under the Hood: When you press Ctrl+Z, the terminal driver sends SIGSTOP to the foreground process group. SIGSTOP is one of the two uncatchable signals — the process has no choice. When you type
bg, the shell sends SIGCONT, and the process resumes running but is no longer in the foreground group. This is how job control has worked since BSD Unix added it in the early 1980s.
The trap that saves your deploy¶
The real fix is writing deploy scripts that clean up after themselves. The trap command
registers a function to run when a signal arrives:
#!/bin/bash
set -euo pipefail
# Track child processes we need to clean up
CHILDREN=()
cleanup() {
echo "Caught signal, cleaning up..."
for pid in "${CHILDREN[@]}"; do
kill "$pid" 2>/dev/null
done
exit 1
}
# Register cleanup for common signals AND normal exit
trap cleanup SIGTERM SIGINT EXIT
# Now launch child processes and track them
ssh target-server "apt-get update" &
CHILDREN+=($!)
ssh target-server "systemctl restart app" &
CHILDREN+=($!)
# Wait for all children — THIS IS CRITICAL
# Without 'wait', the script exits immediately after backgrounding
wait
echo "Deploy complete"
Gotcha: If you run a child process in the foreground (without
&), trap handlers don't fire until that child exits. That means ifsshhangs for 20 minutes, your trap handler sits dormant for 20 minutes. The pattern above — background the children with&, track their PIDs, and usewait— lets trap handlers fire between wait iterations.Gotcha:
set -euo pipefailis the bash safety harness — exit on errors (-e), error on undefined variables (-u), and fail pipelines if any command fails (-o pipefail). But-ehas surprising exceptions: it's silently disabled insideifconditions,&&/||chains, and subshells. Don't rely on it as your only error handling.Remember: The EUP mnemonic: Exit on error, Unset is error, Pipes fail properly.
set -euo pipefail— type it at the top of every production script until it's muscle memory.
What about nohup?¶
You'll see people suggest nohup ./deploy.sh & to keep the script running after you
disconnect. It works, but barely:
# What people do
nohup ./deploy.sh > deploy.log 2>&1 &
# What they forget
# - nohup only blocks SIGHUP, nothing else
# - stdout goes to nohup.out if you don't redirect (fills disk)
# - no restart on failure
# - no resource limits
# - no log rotation
# - no dependency management
For anything you need to survive a terminal disconnect, use systemd-run or a proper
systemd service. nohup is a band-aid from the 1970s that's still in the first-aid kit
for one reason: it's one line and it works in a pinch.
Part 4: Zombies and Orphans¶
While debugging your deploy, you might have noticed something weird in ps output: a
process in state Z — a zombie.
A zombie is a process that has exited but whose parent hasn't called waitpid() to read
the exit code. The process is dead — it uses no CPU, no memory. But it occupies a slot in
the process table (just its PID and exit status — a few bytes).
Parent (running)
├── Child A (running)
├── Child B (zombie) ← exited, but parent hasn't called wait()
└── Child C (running)
Why do zombies exist?¶
Because exit codes matter. When your deploy script finishes, you want to know: did it
succeed (exit 0) or fail (exit non-zero)? The kernel holds onto that information until
someone asks for it via waitpid(). Until then, the process is a zombie — dead but not
yet reaped.
Name Origin: "Zombie process" is a genuine Unix term, not slang. It appears in the original BSD documentation. The analogy is exact: a zombie is dead but still present, lingering until the proper ritual (calling
wait()) lays it to rest.
When zombies are a problem¶
One zombie is fine. Thousands of zombies are a problem — they exhaust the PID space. The
default maximum PID on Linux is 32,768 (check with cat /proc/sys/kernel/pid_max). If a
buggy server process forks children and never waits for them, zombie PIDs accumulate until
the system can't start new processes.
# Count zombies
ps aux | awk '$8 == "Z"' | wc -l
# Find who's creating them (the parent)
ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/ {print $2}' | sort | uniq -c | sort -rn
# → The PID with the most zombie children is your culprit
Gotcha: You can't kill a zombie — it's already dead.
kill -9 <zombie_pid>does nothing because there's no process to kill, just a process table entry. You have two options: (1) kill the parent, which causes the zombies to be re-parented to PID 1, which reaps them automatically, or (2) fix the parent to callwait().
Orphans: the other half of the story¶
An orphan is the opposite scenario: the parent dies while the children are still running.
When that happens, the kernel re-parents all the orphan children to PID 1 (typically
systemd), which automatically calls wait() when they eventually exit.
Before: Parent (PID 500) → Child (PID 501)
Parent dies.
After: systemd (PID 1) → Child (PID 501) ← re-parented
This is usually fine — systemd handles it gracefully. But there's a subtle danger: those
orphan children are still running, still consuming resources, still holding connections.
If your deploy script crashes mid-way, the ssh and curl processes it spawned keep
running on the remote server. They're orphans now, invisible to you, doing who-knows-what.
War Story: A team's deploy script crashed partway through, leaving 50 orphaned worker processes still running on a production server. Each one held a database connection. The pool filled, new requests started timing out, and the monitoring blamed "database slow" for three hours before someone noticed the orphans with
ss -tnp | grep postgres.
Part 5: systemd — Process Management Done Right¶
Everything above — signals, cleanup, PID tracking, zombie reaping — is work you have to do
manually in a bash script. systemd does all of it for you, and more.
When your deploy script needs to restart a service, it should be restarting a systemd unit,
not calling kill and /usr/bin/myapp &:
# /etc/systemd/system/myapp.service
[Unit]
Description=My Application Server
After=network.target postgresql.service
Requires=postgresql.service
[Service]
Type=simple
User=appuser
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server --port 8080
ExecStop=/bin/kill -TERM $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStopSec=30
# Resource limits
MemoryMax=1G
CPUQuota=200%
# Security
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/opt/myapp/data
[Install]
WantedBy=multi-user.target
This unit file replaces hundreds of lines of bash:
| What bash requires | What systemd does |
|---|---|
trap cleanup EXIT and manual PID tracking |
Automatic cgroup tracking — every child process is accounted for |
nohup and output redirection |
Journal logging — journalctl -u myapp |
while true; do ./app; sleep 5; done for restart |
Restart=on-failure + RestartSec=5 |
Manual kill sequences with timeouts |
TimeoutStopSec=30 then automatic SIGKILL |
ulimit and manual cgroup setup |
MemoryMax=, CPUQuota= — enforced by cgroups |
File permission checks and sudo |
User=, NoNewPrivileges=, ProtectSystem= |
Name Origin: systemd's "d" stands for "daemon." The word "daemon" in computing comes from Maxwell's demon — a thought experiment in thermodynamics where an imaginary being sorts fast and slow molecules. MIT programmers in the 1960s adopted the term for background processes that do work automatically, invisibly, like Maxwell's demon sorting molecules. It has nothing to do with the occult. The BSD mascot (a red daemon with a pitchfork) is a visual pun, not an etymology.
The commands you'll use every day¶
# Start / stop / restart
systemctl start myapp
systemctl stop myapp # Sends SIGTERM, waits TimeoutStopSec, then SIGKILL
systemctl restart myapp # stop + start
# Check status (shows PID, memory, recent log lines)
systemctl status myapp
# View logs
journalctl -u myapp -f # Follow (live tail)
journalctl -u myapp -n 100 # Last 100 lines
journalctl -u myapp --since "10 minutes ago"
# Reload config without restart (sends SIGHUP)
systemctl reload myapp
# See what's broken
systemctl list-units --failed
Gotcha:
systemctl enable myappdoes NOT start the service. It creates a symlink so the service starts on boot. If you want both:systemctl enable --now myapp. This confuses everyone at least once.Gotcha: After editing a unit file, you MUST run
systemctl daemon-reloadbefore restart. systemd caches unit definitions in memory. Without the reload, your changes are invisible — systemd happily uses the old cached version, and you spend 20 minutes wondering why your config change didn't take effect.
systemd's dependency model¶
The unit file above has:
These are two different things and you almost always need both:
After=controls ordering — "start me after this unit is up"Requires=controls dependency — "if this unit fails, I fail too"
| Directive | What it does | What it DOESN'T do |
|---|---|---|
After=X |
Start after X is ready | Doesn't make X a hard dependency — if X isn't being started, this unit starts anyway |
Requires=X |
If X fails, I fail | Doesn't control ordering — both might start simultaneously |
Wants=X |
Soft dependency — I'd like X running, but I'll survive without it | Doesn't control ordering either |
Requires=X + After=X |
Hard dependency with correct ordering | This is what you want 90% of the time |
Trivia: systemd replaced SysV init scripts — the system that used numbered shell scripts (
S01network,S02sshd) to control boot order sequentially. systemd parallelizes boot using a dependency graph, which reduced boot times from 30-60 seconds to under 5. It's also one of the most controversial projects in Linux history — the Debian vote in 2014 nearly split the project, spawned the Devuan fork, and the arguments haven't really stopped since. Lennart Poettering, the primary author, previously created PulseAudio — also controversial. The running joke is that "systemd is an operating system" because it's absorbed init, cron (timers), syslog (journald), DNS (resolved), network config (networkd), login management (logind), and device management (udevd).
Debugging a service that won't start¶
Your deploy does systemctl restart myapp and it fails. Here's the systematic approach:
# 1. Check status — shows exit code and recent log lines
systemctl status myapp
# 2. Full logs for this unit
journalctl -u myapp -n 50
# 3. See the actual unit file (with overrides applied)
systemctl cat myapp
# 4. Check runtime values
systemctl show myapp -p Environment,WorkingDirectory,ExecStart
# 5. Try running the command manually as the service user
sudo -u appuser /opt/myapp/bin/server --port 8080
# → This often reveals missing env vars, wrong paths, or permission errors
# that don't show up in journal output
Gotcha: systemd runs services in a different execution context than your interactive shell. The service doesn't get your
~/.bashrc, your$PATHmight be minimal, and environment variables from your session don't exist. If the app works when you run it manually but fails under systemd, the first thing to check is the environment. AddEnvironment=orEnvironmentFile=to the unit file.
Part 6: Containers Change the Rules¶
Everything you've learned so far assumes a traditional Linux system with systemd as PID 1. Containers change some of the rules — and knowing which rules change is the difference between "containers just work" and debugging mysterious hangs.
PID 1 in containers is special (and dangerous)¶
In a container, the process specified in CMD or ENTRYPOINT becomes PID 1. PID 1 has
special behavior in the kernel:
-
Signals are different. The kernel only delivers signals that PID 1 has explicitly registered handlers for. If your app doesn't handle SIGTERM,
docker stopsends SIGTERM, your app ignores it (because it never registered a handler), Docker waits 10 seconds, then sends SIGKILL. Every. Single. Time. -
Zombie reaping. PID 1 is supposed to call
wait()on orphaned children. If your app is PID 1 and it doesn't do this (most applications don't), zombies accumulate inside the container.
# BAD — your app is PID 1, doesn't handle signals or reap zombies
FROM python:3.11-slim
CMD ["python", "server.py"]
# GOOD — tini is PID 1, handles signals and zombie reaping
FROM python:3.11-slim
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["python", "server.py"]
# ALSO GOOD — Docker's built-in init
# docker run --init myimage
Under the Hood:
tiniis a tiny (~30KB) init process purpose-built for containers. It does exactly two things: forward signals to child processes, and reap zombies. That's the entire source code — about 200 lines of C. Docker's--initflag does the same thing using a bundled copy of tini.
Cgroups: why the OOM killer finds your container¶
When you set --memory=1g on a Docker container, you're configuring a cgroup (control
group) limit. The kernel's OOM killer watches cgroup memory usage and kills processes that
exceed their limit.
# See a container's cgroup memory limit
docker inspect --format='{{.HostConfig.Memory}}' mycontainer
# Inside the container, check your own limits
cat /sys/fs/cgroup/memory.max # cgroups v2
cat /sys/fs/cgroup/memory/memory.limit_in_bytes # cgroups v1
# See current usage
cat /sys/fs/cgroup/memory.current
When the OOM killer strikes, the process gets SIGKILL (exit code 137). Your application never sees it coming — no graceful shutdown, no cleanup, no "I'm about to die" log line.
Gotcha:
docker statsshows memory usage including the page cache, which can make it look like your container is using its entire memory limit when it's actually fine. The kernel will evict page cache under pressure. RSS (resident set size) is what you care about for OOM risk, not total usage.
PID namespaces: the container's alternate reality¶
Inside a container, your process thinks it's PID 1. On the host, it's PID 48372. This is a PID namespace — a kernel feature that gives each container its own independent PID numbering.
# Inside the container
ps aux
# → PID 1 is your app
# On the host
docker top mycontainer
# → PID 48372 is the same process
# The kernel tracks both
cat /proc/48372/status | grep NSpid
# → NSpid: 48372 1
# First number = host PID, second = container PID
This matters for debugging: if you see an OOM kill in dmesg on the host, it shows the
host PID. You need to map it back to the container to figure out what died.
Flashcard Check¶
Cover the answers and test yourself.
Q1: What two system calls does the shell use to run every command?
fork()andexec(). Fork creates a copy, exec replaces it with the new program. The parent waits withwaitpid().
Q2: What's the difference between SIGTERM and SIGKILL?
SIGTERM (15) can be caught and handled — the process can clean up. SIGKILL (9) can't be caught — the kernel terminates the process immediately. Always try SIGTERM first.
Q3: A process is in state D. Can you kill it?
No.
D= uninterruptible sleep, stuck in kernel I/O. No signal is delivered. Fix the I/O subsystem (NFS, disk) or reboot. This is the only state wherekill -9fails.
Q4: What is a zombie process and how do you get rid of it?
A process that has exited but whose parent hasn't called
wait(). Uses only a PID slot. You can't kill it (it's already dead) — kill or fix the parent instead.
Q5: systemctl enable myapp — is the service running now?
No.
enableonly creates a boot-time symlink. Useenable --nowto also start it immediately.
Q6: Why do you need daemon-reload after editing a unit file?
systemd caches unit definitions in memory. Without
daemon-reload, it uses the old cached version and your changes have no effect.
Q7: Why does docker stop take 10 seconds for some containers?
Docker sends SIGTERM, but if PID 1 hasn't registered a SIGTERM handler, the signal is ignored. Docker waits 10 seconds then sends SIGKILL. Fix: use
tinior--init.
Q8: Exit code 137 — what happened?
128 + 9 = SIGKILL. Either the OOM killer or someone ran
kill -9. Exit code 143 = 128 + 15 = SIGTERM.
Exercises¶
Exercise 1: Find the stuck process (investigation)¶
You have a process tree where something is hanging. Given this pstree output:
And this ps output:
PID STAT WCHAN COMM
1001 S wait deploy.sh
1002 S poll_schedule_to ssh
1003 D nfs_wait_bit_kil apt-get
1004 S poll_schedule_to curl
1005 S hrtimer_nanosle sleep
Which process is stuck? Can you kill it? What should you do?
Answer
PID 1003 (`apt-get`) is in state `D` (uninterruptible sleep) with wchan `nfs_wait_bit_kil` — it's waiting on an NFS operation. You **cannot** kill it, not even with SIGKILL. The NFS server is likely unreachable. Fix the NFS mount (check network, restart NFS server) and the process will unblock. If you can't fix NFS, the only option is a reboot.Exercise 2: Write a safe deploy script (bash)¶
Write a bash script that:
1. SSHes to a server and runs systemctl restart myapp
2. Waits up to 60 seconds for the service to be healthy (check curl -f http://server:8080/health)
3. If the health check fails, rolls back by running systemctl restart myapp-previous
4. Cleans up properly on Ctrl+C or any error
Hint 1
Use `set -euo pipefail` and `trap cleanup EXIT` as your foundation.Hint 2
Use a `for` loop with `sleep` for the health check, not an infinite loop. The exit code of `curl -f` tells you if the endpoint returned 2xx.Solution
#!/bin/bash
set -euo pipefail
SERVER="target-server"
TIMEOUT=60
HEALTH_URL="http://${SERVER}:8080/health"
ROLLED_BACK=false
cleanup() {
local exit_code=$?
if [[ $exit_code -ne 0 ]] && [[ "$ROLLED_BACK" != "true" ]]; then
echo "ERROR: Deploy failed (exit $exit_code), rolling back..."
ssh "$SERVER" "systemctl restart myapp-previous" || true
ROLLED_BACK=true
fi
exit "$exit_code"
}
trap cleanup EXIT
echo "Restarting myapp on $SERVER..."
ssh "$SERVER" "systemctl restart myapp"
echo "Waiting up to ${TIMEOUT}s for health check..."
for i in $(seq 1 "$TIMEOUT"); do
if curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
echo "Healthy after ${i}s"
exit 0
fi
sleep 1
done
echo "Health check timed out after ${TIMEOUT}s"
exit 1
Exercise 3: Zombie factory (hands-on)¶
Create a zombie process on purpose, verify it exists, then clean it up.
Hint
You need a parent that forks a child but never calls `wait()`. In bash, backgrounding a process and then sleeping forever (without `wait`) does this after the child exits.Solution
# Terminal 1: Create a zombie
bash -c '
# Fork a child that exits immediately
/bin/true &
CHILD=$!
echo "Child PID: $CHILD, Parent PID: $$"
# Sleep without ever calling wait
sleep 300
' &
PARENT=$!
# Wait a moment for the child to exit
sleep 2
# Verify zombie exists
ps -eo pid,ppid,stat,comm | grep Z
# → You should see the child in Z state with PPID = the bash subshell
# Clean it up by killing the parent
kill $PARENT
# Verify zombie is gone (PID 1 reaped it)
sleep 1
ps -eo pid,ppid,stat,comm | grep Z
# → Should be empty
Exercise 4: Write a systemd unit (design)¶
Your team has a Python web app that:
- Lives at /opt/webapp/app.py
- Needs PostgreSQL running first
- Should restart if it crashes, but stop retrying after 5 failures in 2 minutes
- Needs DATABASE_URL from /opt/webapp/.env
- Should not be able to write anywhere except /opt/webapp/data/
Write the systemd unit file. Don't look at the solution until you've tried.
Hint
You need `Requires=` + `After=` for PostgreSQL, `EnvironmentFile=` for the env vars, `Restart=on-failure` + `StartLimitBurst` + `StartLimitIntervalSec` for retry limits, and `ProtectSystem=strict` + `ReadWritePaths=` for filesystem isolation.Solution
[Unit]
Description=Web Application
After=network.target postgresql.service
Requires=postgresql.service
[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/usr/bin/python3 /opt/webapp/app.py
EnvironmentFile=/opt/webapp/.env
Restart=on-failure
RestartSec=5
StartLimitBurst=5
StartLimitIntervalSec=120
ProtectSystem=strict
ReadWritePaths=/opt/webapp/data
NoNewPrivileges=true
PrivateTmp=true
[Install]
WantedBy=multi-user.target
Exercise 5: The decision (think, don't code)¶
For each scenario, decide: bash with trap and wait, a systemd service, or a
container with tini? Justify your choice.
- A one-shot backup script that runs nightly via cron
- A long-running API server that must survive reboots
- A batch job that processes files from a queue and should retry on failure
- A deploy script that SSHes to 5 servers sequentially
- A microservice that runs alongside 15 others on the same host
Answers
1. **Systemd timer** (not cron). You get logging via `journalctl`, automatic cgroup resource limits, `Persistent=true` to catch up missed runs, and `RandomizedDelaySec` to avoid thundering herd. Plain cron gives you none of this. 2. **Systemd service.** `Restart=on-failure`, `WantedBy=multi-user.target` for boot, resource limits, journal logging. This is exactly what systemd was built for. 3. **Systemd service** with `Restart=on-failure` and `RestartSec=` for backoff. A container works too if you're already containerized, but the retry logic comes from the orchestrator (systemd or Kubernetes), not the container itself. 4. **Bash with `trap` and `set -euo pipefail`.** It's a one-shot script, not a long-running service. A systemd service is overkill. But the script needs proper signal handling and cleanup — this is where `trap cleanup EXIT` earns its keep. 5. **Container with `tini` (or `--init`).** 15 microservices on one host = you need isolation. Containers give you PID namespaces, cgroup limits, filesystem isolation, and independent logging. `tini` handles signals and zombie reaping inside each container.Cheat Sheet¶
Process Investigation¶
| What you need | Command |
|---|---|
| Process tree | pstree -p <pid> |
| What a process is doing | cat /proc/<pid>/wchan |
| Process state | ps -o pid,stat,comm -p <pid> |
| All processes sorted by CPU | ps aux --sort=-%cpu \| head |
| All processes sorted by memory | ps aux --sort=-rss \| head |
| Find zombies | ps -eo pid,ppid,stat,comm \| awk '$3 ~ /Z/' |
| Find D-state (unkillable) | ps aux \| awk '$8 ~ /^D/' |
| Open files for a process | ls /proc/<pid>/fd \| wc -l |
| Deleted but open files (eating disk) | lsof +L1 |
| Kernel stack of stuck process | cat /proc/<pid>/stack (root) |
| Network connections per process | ss -tnp \| grep "pid=<pid>" |
Signals¶
| Signal | Num | Meaning | Catchable? |
|---|---|---|---|
| SIGHUP | 1 | Hangup / reload config | Yes |
| SIGINT | 2 | Interrupt (Ctrl+C) | Yes |
| SIGKILL | 9 | Force kill | No |
| SIGTERM | 15 | Graceful terminate | Yes |
| SIGSTOP | 19 | Freeze (Ctrl+Z) | No |
| SIGCONT | 18 | Resume | Yes |
Exit code = 128 + signal number (137 = SIGKILL, 143 = SIGTERM)
systemd¶
| Task | Command |
|---|---|
| Start/stop/restart | systemctl {start,stop,restart} <unit> |
| Status + recent logs | systemctl status <unit> |
| Follow logs live | journalctl -u <unit> -f |
| Enable at boot + start now | systemctl enable --now <unit> |
| Reload after editing unit file | systemctl daemon-reload |
| See what's broken | systemctl list-units --failed |
| View unit with overrides | systemctl cat <unit> |
| Boot time analysis | systemd-analyze blame |
Bash Script Safety¶
#!/bin/bash
set -euo pipefail # EUP: Exit, Unset, Pipefail
trap cleanup EXIT # Always clean up
children=(); cmd & children+=($!) # Track child PIDs
wait # Let traps fire between waits
Takeaways¶
-
Every command is fork + exec + wait. Your shell forks a copy of itself, the copy replaces itself with the command, the shell waits. When it doesn't return, something in that chain is stuck.
-
TERM first, KILL last. SIGTERM lets the process clean up. SIGKILL doesn't. Exit code 137 = SIGKILL, 143 = SIGTERM. Starting with
kill -9is like unplugging a server to reboot it. -
D-state is unkillable. If a process is in uninterruptible sleep, no signal reaches it. Fix the I/O subsystem or reboot. Everything else can be killed.
-
Zombies are harmless until they're not. One zombie is fine. Thousands exhaust the PID space. Kill the parent, not the zombie.
-
systemd replaces hundreds of lines of bash. Unit files give you restart logic, resource limits, dependency management, logging, and cgroup tracking — for free.
-
PID 1 in containers needs an init. Without
tinior--init, your app doesn't handle signals correctly and zombies accumulate. This causes the mysterious 10-seconddocker stopdelay.
Related Lessons¶
- What Happens When You Click a Link — end-to-end trace through networking
- Connection Refused — differential diagnosis of a common error across layers