The Hanging Deploy

lesson
processes
signals
systemd
bash-job-control
containers
cgroups ---# The Hanging Deploy

Topics: processes, signals, systemd, bash job control, containers, cgroups Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

It's 2pm on a Thursday. You ran a deploy script and it just... stopped. The terminal is frozen. No output. No error. Ctrl+C does nothing. Your colleague asks "is the deploy done?" and you don't know.

This happens more than anyone admits. And the fix depends on understanding something most people never learn properly: what a process actually is, how processes talk to each other, and what happens when they don't.

By the end of this lesson you'll understand: - What your shell actually does when you run a command - Why processes hang, and the systematic way to unstick them - How signals work (and why Ctrl+C sometimes doesn't) - What zombies are, why they exist, and why they're usually harmless - How systemd wraps all of this, and how containers change the rules

We'll start at the bottom — what the kernel actually does when you type a command — and work our way up through signals, job control, systemd, and containers. Each layer builds on the one before it.

Part 1: What Just Happened When You Ran That Script¶

When you typed ./deploy.sh and hit enter, your shell did something beautifully simple that's been the same since 1969: it called fork(), then exec().

# This is what your shell does internally (simplified):
pid = fork()       # Create an exact copy of the shell process
if pid == 0:       # In the child process:
    exec("./deploy.sh")  # Replace yourself with the deploy script
else:              # In the parent (your shell):
    waitpid(pid)   # Sit here until the child exits

That's it. That's the whole model. Every command you've ever run in a terminal works this way.

Under the Hood: fork() doesn't actually copy all the memory — it uses copy-on-write (COW). The parent and child share the same physical memory pages until one of them writes to a page, at which point the kernel copies just that page. This makes fork() fast even for processes using gigabytes of RAM. Redis exploits this for background saves — it forks, the child writes the snapshot to disk, and because most pages are read-only, almost no memory is copied.

Name Origin: fork() was adapted from Multics for the original Unix at Bell Labs. The Unix twist: the child is an exact duplicate of the parent, distinguished only by the return value. In the child, fork() returns 0. In the parent, it returns the child's PID. This elegant split means the child can set up redirections, change directories, or drop privileges between fork() and exec() — which is why shell I/O redirection works at all.

Why this matters for your stuck deploy¶

Your shell is sitting in waitpid() — it's blocked, waiting for the child process (your deploy script) to exit. The script hasn't exited. That's why your terminal is frozen.

But here's the thing: deploy.sh probably isn't one process. It's a tree of processes. The script calls ssh, which calls apt, which calls dpkg, which forks workers. Any one of those descendants could be the one that's stuck.

Let's find out which one. Open another terminal:

# Find your deploy script's process tree
pstree -p $(pgrep -f deploy.sh)

You might see something like:

deploy.sh(14523)─┬─ssh(14524)─┬─apt-get(14587)───dpkg(14601)
                  │            └─tee(14525)
                  └─curl(14530)

Now you can see the whole family. The stuck process is probably apt-get or dpkg — waiting for a lock, waiting for a prompt, or waiting for network I/O.

# What is the stuck process doing right now?
cat /proc/14601/wchan
# → "do_wait" or "poll_schedule_timeout" or something else revealing

# What state is it in?
ps -o pid,stat,wchan,comm -p 14601
# → 14601 S  poll_schedule_timeout dpkg

The stat column tells you the process state:

State	Meaning	Can you kill it?
`R`	Running or runnable	Yes
`S`	Sleeping (waiting for something — I/O, timer, signal)	Yes
`D`	Uninterruptible sleep (waiting for disk/NFS I/O)	No — not even SIGKILL
`Z`	Zombie (exited but parent hasn't read the exit code)	Already dead — kill the parent instead
`T`	Stopped (Ctrl+Z or SIGSTOP)	Yes, resume with SIGCONT

Gotcha: If you see D state, you're in trouble. The process is stuck in a kernel I/O operation — usually NFS, a flaky disk, or a FUSE mount. No signal can reach it because the kernel won't deliver signals during uninterruptible I/O. You have to fix the I/O subsystem (restart NFS, fix the disk) or reboot. This is the only process state where kill -9 genuinely does nothing.

Part 2: Talking to Processes — Signals¶

Your deploy is stuck. Time to communicate with it. In Unix, processes communicate through signals — small numbered messages that the kernel delivers.

When you hit Ctrl+C, you're sending signal 2 (SIGINT). When you run kill <pid>, you're sending signal 15 (SIGTERM). These aren't magic — they're just integers with conventions attached.

# See all the signals on your system
kill -l

The ones you'll actually use:

Signal	Number	Keyboard	What it means
SIGHUP	1	—	"Hang up" — originally meant the modem disconnected. Now means "reload your config" for most daemons
SIGINT	2	Ctrl+C	"Interrupt" — polite request to stop
SIGKILL	9	—	"Die now" — kernel terminates the process, no handler possible
SIGTERM	15	—	"Terminate" — polite request to exit cleanly. This is the default `kill` signal
SIGSTOP	19	Ctrl+Z	"Freeze" — process is suspended, can be resumed
SIGCONT	18	—	"Continue" — resume a stopped process

Name Origin: SIGHUP stands for "Signal Hang Up" — from the days of telephone modems. When the physical phone line disconnected, the terminal driver sent SIGHUP to all processes on that terminal. Today, nohup literally means "immune to hangup" — it blocks SIGHUP so your process survives when you close the terminal. Many daemons reinterpret SIGHUP as "please reload your configuration file," which is why kill -HUP nginx reloads nginx without dropping connections.

Why Ctrl+C didn't work¶

Here's the thing most people miss: Ctrl+C doesn't send SIGINT to the process you're looking at. It sends SIGINT to the entire foreground process group.

# See what process group your deploy script is in
ps -o pid,pgid,comm -p 14523
# → 14523 14523 deploy.sh

# See all processes in that group
ps -o pid,pgid,comm -g 14523

But there's a problem: if deploy.sh launched ssh, and ssh launched something on a remote server, the remote process is NOT in your local process group. Ctrl+C kills the local ssh client, but the remote apt-get keeps running — headless, parentless, doing whatever it was doing.

This is why Ctrl+C "didn't work" — the actual stuck process was remote, and killing the SSH client just disconnected you from it.

Mental Model: Think of signals like phone calls. SIGTERM is a call saying "please hang up." The process can answer and handle it gracefully — flush buffers, close connections, write a final log line. SIGKILL is the phone company cutting the line — no negotiation, no cleanup, no last words. Only two signals can't be caught or ignored: SIGKILL (9) and SIGSTOP (19). Everything else is negotiable.

The correct kill sequence¶

Never start with kill -9. Always:

# Step 1: Ask politely (SIGTERM, the default)
kill $PID

# Step 2: Wait for it to clean up (10-30 seconds depending on the service)
sleep 10

# Step 3: Check if it's still alive (kill -0 sends no signal, just checks)
kill -0 $PID 2>/dev/null && echo "still alive" || echo "dead"

# Step 4: Only if still alive, force it
kill -0 $PID 2>/dev/null && kill -9 $PID

Remember: "TERM asks, KILL takes." SIGTERM (15) = polite, SIGKILL (9) = forced. Exit code = 128 + signal number, so SIGTERM death = exit 143, SIGKILL death = exit 137. When you see exit code 137 in your CI logs, that's the OOM killer or a kill -9.

War Story: A team had a deploy script that called kill -9 as the very first step when restarting their Java service. The JVM never got a chance to flush write-ahead logs or close database connections cleanly. For months they had intermittent data corruption — transactions would half-complete, with money debited but not credited. The fix was changing kill -9 to kill -TERM, adding a 30-second timeout, and only escalating to SIGKILL if the graceful shutdown failed.

Part 3: Bash Job Control — Background, Foreground, and Traps¶

Back to your stuck deploy. You've identified the stuck process and killed it. But now you want to prevent this from happening again. Let's look at how bash manages processes and how to write scripts that don't get stuck.

Running things in the background¶

# Run a command in the background
long-running-task &

# $! is the PID of the last backgrounded process
echo "Started task with PID $!"

# List background jobs
jobs
# → [1]+  Running    long-running-task &

# Bring it back to the foreground
fg %1

# Or suspend whatever's in the foreground
# Ctrl+Z

# And resume it in the background
bg %1

Under the Hood: When you press Ctrl+Z, the terminal driver sends SIGSTOP to the foreground process group. SIGSTOP is one of the two uncatchable signals — the process has no choice. When you type bg, the shell sends SIGCONT, and the process resumes running but is no longer in the foreground group. This is how job control has worked since BSD Unix added it in the early 1980s.

The trap that saves your deploy¶

The real fix is writing deploy scripts that clean up after themselves. The trap command registers a function to run when a signal arrives:

#!/bin/bash
set -euo pipefail

# Track child processes we need to clean up
CHILDREN=()

cleanup() {
    echo "Caught signal, cleaning up..."
    for pid in "${CHILDREN[@]}"; do
        kill "$pid" 2>/dev/null
    done
    exit 1
}

# Register cleanup for common signals AND normal exit
trap cleanup SIGTERM SIGINT EXIT

# Now launch child processes and track them
ssh target-server "apt-get update" &
CHILDREN+=($!)

ssh target-server "systemctl restart app" &
CHILDREN+=($!)

# Wait for all children — THIS IS CRITICAL
# Without 'wait', the script exits immediately after backgrounding
wait

echo "Deploy complete"

Gotcha: If you run a child process in the foreground (without &), trap handlers don't fire until that child exits. That means if ssh hangs for 20 minutes, your trap handler sits dormant for 20 minutes. The pattern above — background the children with &, track their PIDs, and use wait — lets trap handlers fire between wait iterations.

Gotcha: set -euo pipefail is the bash safety harness — exit on errors (-e), error on undefined variables (-u), and fail pipelines if any command fails (-o pipefail). But -e has surprising exceptions: it's silently disabled inside if conditions, &&/|| chains, and subshells. Don't rely on it as your only error handling.

Remember: The EUP mnemonic: Exit on error, Unset is error, Pipes fail properly. set -euo pipefail — type it at the top of every production script until it's muscle memory.

What about `nohup`?¶

You'll see people suggest nohup ./deploy.sh & to keep the script running after you disconnect. It works, but barely:

# What people do
nohup ./deploy.sh > deploy.log 2>&1 &

# What they forget
# - nohup only blocks SIGHUP, nothing else
# - stdout goes to nohup.out if you don't redirect (fills disk)
# - no restart on failure
# - no resource limits
# - no log rotation
# - no dependency management

For anything you need to survive a terminal disconnect, use systemd-run or a proper systemd service. nohup is a band-aid from the 1970s that's still in the first-aid kit for one reason: it's one line and it works in a pinch.

Part 4: Zombies and Orphans¶

While debugging your deploy, you might have noticed something weird in ps output: a process in state Z — a zombie.

# Find zombie processes
ps aux | awk '$8 == "Z"'
# or
ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/'

A zombie is a process that has exited but whose parent hasn't called waitpid() to read the exit code. The process is dead — it uses no CPU, no memory. But it occupies a slot in the process table (just its PID and exit status — a few bytes).

Parent (running)
  ├── Child A (running)
  ├── Child B (zombie) ← exited, but parent hasn't called wait()
  └── Child C (running)

Why do zombies exist?¶

Because exit codes matter. When your deploy script finishes, you want to know: did it succeed (exit 0) or fail (exit non-zero)? The kernel holds onto that information until someone asks for it via waitpid(). Until then, the process is a zombie — dead but not yet reaped.

Name Origin: "Zombie process" is a genuine Unix term, not slang. It appears in the original BSD documentation. The analogy is exact: a zombie is dead but still present, lingering until the proper ritual (calling wait()) lays it to rest.

When zombies are a problem¶

One zombie is fine. Thousands of zombies are a problem — they exhaust the PID space. The default maximum PID on Linux is 32,768 (check with cat /proc/sys/kernel/pid_max). If a buggy server process forks children and never waits for them, zombie PIDs accumulate until the system can't start new processes.

# Count zombies
ps aux | awk '$8 == "Z"' | wc -l

# Find who's creating them (the parent)
ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/ {print $2}' | sort | uniq -c | sort -rn
# → The PID with the most zombie children is your culprit

Gotcha: You can't kill a zombie — it's already dead. kill -9 <zombie_pid> does nothing because there's no process to kill, just a process table entry. You have two options: (1) kill the parent, which causes the zombies to be re-parented to PID 1, which reaps them automatically, or (2) fix the parent to call wait().

Orphans: the other half of the story¶

An orphan is the opposite scenario: the parent dies while the children are still running. When that happens, the kernel re-parents all the orphan children to PID 1 (typically systemd), which automatically calls wait() when they eventually exit.

Before: Parent (PID 500) → Child (PID 501)
Parent dies.
After:  systemd (PID 1)  → Child (PID 501)  ← re-parented

This is usually fine — systemd handles it gracefully. But there's a subtle danger: those orphan children are still running, still consuming resources, still holding connections. If your deploy script crashes mid-way, the ssh and curl processes it spawned keep running on the remote server. They're orphans now, invisible to you, doing who-knows-what.

War Story: A team's deploy script crashed partway through, leaving 50 orphaned worker processes still running on a production server. Each one held a database connection. The pool filled, new requests started timing out, and the monitoring blamed "database slow" for three hours before someone noticed the orphans with ss -tnp | grep postgres.

Part 5: systemd — Process Management Done Right¶

Everything above — signals, cleanup, PID tracking, zombie reaping — is work you have to do manually in a bash script. systemd does all of it for you, and more.

When your deploy script needs to restart a service, it should be restarting a systemd unit, not calling kill and /usr/bin/myapp &:

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application Server
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=appuser
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/bin/server --port 8080
ExecStop=/bin/kill -TERM $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStopSec=30

# Resource limits
MemoryMax=1G
CPUQuota=200%

# Security
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/opt/myapp/data

[Install]
WantedBy=multi-user.target

This unit file replaces hundreds of lines of bash:

What bash requires	What systemd does
`trap cleanup EXIT` and manual PID tracking	Automatic cgroup tracking — every child process is accounted for
`nohup` and output redirection	Journal logging — `journalctl -u myapp`
`while true; do ./app; sleep 5; done` for restart	`Restart=on-failure` + `RestartSec=5`
Manual `kill` sequences with timeouts	`TimeoutStopSec=30` then automatic SIGKILL
`ulimit` and manual cgroup setup	`MemoryMax=`, `CPUQuota=` — enforced by cgroups
File permission checks and `sudo`	`User=`, `NoNewPrivileges=`, `ProtectSystem=`

Name Origin: systemd's "d" stands for "daemon." The word "daemon" in computing comes from Maxwell's demon — a thought experiment in thermodynamics where an imaginary being sorts fast and slow molecules. MIT programmers in the 1960s adopted the term for background processes that do work automatically, invisibly, like Maxwell's demon sorting molecules. It has nothing to do with the occult. The BSD mascot (a red daemon with a pitchfork) is a visual pun, not an etymology.

The commands you'll use every day¶

# Start / stop / restart
systemctl start myapp
systemctl stop myapp       # Sends SIGTERM, waits TimeoutStopSec, then SIGKILL
systemctl restart myapp    # stop + start

# Check status (shows PID, memory, recent log lines)
systemctl status myapp

# View logs
journalctl -u myapp -f     # Follow (live tail)
journalctl -u myapp -n 100 # Last 100 lines
journalctl -u myapp --since "10 minutes ago"

# Reload config without restart (sends SIGHUP)
systemctl reload myapp

# See what's broken
systemctl list-units --failed

Gotcha: systemctl enable myapp does NOT start the service. It creates a symlink so the service starts on boot. If you want both: systemctl enable --now myapp. This confuses everyone at least once.

Gotcha: After editing a unit file, you MUST run systemctl daemon-reload before restart. systemd caches unit definitions in memory. Without the reload, your changes are invisible — systemd happily uses the old cached version, and you spend 20 minutes wondering why your config change didn't take effect.

systemd's dependency model¶

The unit file above has:

After=network.target postgresql.service
Requires=postgresql.service

These are two different things and you almost always need both:

After= controls ordering — "start me after this unit is up"
Requires= controls dependency — "if this unit fails, I fail too"

Directive	What it does	What it DOESN'T do
`After=X`	Start after X is ready	Doesn't make X a hard dependency — if X isn't being started, this unit starts anyway
`Requires=X`	If X fails, I fail	Doesn't control ordering — both might start simultaneously
`Wants=X`	Soft dependency — I'd like X running, but I'll survive without it	Doesn't control ordering either
`Requires=X` + `After=X`	Hard dependency with correct ordering	This is what you want 90% of the time

Trivia: systemd replaced SysV init scripts — the system that used numbered shell scripts (S01network, S02sshd) to control boot order sequentially. systemd parallelizes boot using a dependency graph, which reduced boot times from 30-60 seconds to under 5. It's also one of the most controversial projects in Linux history — the Debian vote in 2014 nearly split the project, spawned the Devuan fork, and the arguments haven't really stopped since. Lennart Poettering, the primary author, previously created PulseAudio — also controversial. The running joke is that "systemd is an operating system" because it's absorbed init, cron (timers), syslog (journald), DNS (resolved), network config (networkd), login management (logind), and device management (udevd).

Debugging a service that won't start¶

Your deploy does systemctl restart myapp and it fails. Here's the systematic approach:

# 1. Check status — shows exit code and recent log lines
systemctl status myapp

# 2. Full logs for this unit
journalctl -u myapp -n 50

# 3. See the actual unit file (with overrides applied)
systemctl cat myapp

# 4. Check runtime values
systemctl show myapp -p Environment,WorkingDirectory,ExecStart

# 5. Try running the command manually as the service user
sudo -u appuser /opt/myapp/bin/server --port 8080
# → This often reveals missing env vars, wrong paths, or permission errors
#   that don't show up in journal output

Gotcha: systemd runs services in a different execution context than your interactive shell. The service doesn't get your ~/.bashrc, your $PATH might be minimal, and environment variables from your session don't exist. If the app works when you run it manually but fails under systemd, the first thing to check is the environment. Add Environment= or EnvironmentFile= to the unit file.

Part 6: Containers Change the Rules¶

Everything you've learned so far assumes a traditional Linux system with systemd as PID 1. Containers change some of the rules — and knowing which rules change is the difference between "containers just work" and debugging mysterious hangs.

PID 1 in containers is special (and dangerous)¶

In a container, the process specified in CMD or ENTRYPOINT becomes PID 1. PID 1 has special behavior in the kernel:

Signals are different. The kernel only delivers signals that PID 1 has explicitly registered handlers for. If your app doesn't handle SIGTERM, docker stop sends SIGTERM, your app ignores it (because it never registered a handler), Docker waits 10 seconds, then sends SIGKILL. Every. Single. Time.
Zombie reaping. PID 1 is supposed to call wait() on orphaned children. If your app is PID 1 and it doesn't do this (most applications don't), zombies accumulate inside the container.

# BAD — your app is PID 1, doesn't handle signals or reap zombies
FROM python:3.11-slim
CMD ["python", "server.py"]

# GOOD — tini is PID 1, handles signals and zombie reaping
FROM python:3.11-slim
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["python", "server.py"]

# ALSO GOOD — Docker's built-in init
# docker run --init myimage

Under the Hood: tini is a tiny (~30KB) init process purpose-built for containers. It does exactly two things: forward signals to child processes, and reap zombies. That's the entire source code — about 200 lines of C. Docker's --init flag does the same thing using a bundled copy of tini.

Cgroups: why the OOM killer finds your container¶

When you set --memory=1g on a Docker container, you're configuring a cgroup (control group) limit. The kernel's OOM killer watches cgroup memory usage and kills processes that exceed their limit.

# See a container's cgroup memory limit
docker inspect --format='{{.HostConfig.Memory}}' mycontainer

# Inside the container, check your own limits
cat /sys/fs/cgroup/memory.max    # cgroups v2
cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # cgroups v1

# See current usage
cat /sys/fs/cgroup/memory.current

When the OOM killer strikes, the process gets SIGKILL (exit code 137). Your application never sees it coming — no graceful shutdown, no cleanup, no "I'm about to die" log line.

Gotcha: docker stats shows memory usage including the page cache, which can make it look like your container is using its entire memory limit when it's actually fine. The kernel will evict page cache under pressure. RSS (resident set size) is what you care about for OOM risk, not total usage.

PID namespaces: the container's alternate reality¶

Inside a container, your process thinks it's PID 1. On the host, it's PID 48372. This is a PID namespace — a kernel feature that gives each container its own independent PID numbering.

# Inside the container
ps aux
# → PID 1 is your app

# On the host
docker top mycontainer
# → PID 48372 is the same process

# The kernel tracks both
cat /proc/48372/status | grep NSpid
# → NSpid: 48372  1
# First number = host PID, second = container PID

This matters for debugging: if you see an OOM kill in dmesg on the host, it shows the host PID. You need to map it back to the container to figure out what died.

Flashcard Check¶

Cover the answers and test yourself.

Q1: What two system calls does the shell use to run every command?

fork() and exec(). Fork creates a copy, exec replaces it with the new program. The parent waits with waitpid().

Q2: What's the difference between SIGTERM and SIGKILL?

SIGTERM (15) can be caught and handled — the process can clean up. SIGKILL (9) can't be caught — the kernel terminates the process immediately. Always try SIGTERM first.

Q3: A process is in state D. Can you kill it?

No. D = uninterruptible sleep, stuck in kernel I/O. No signal is delivered. Fix the I/O subsystem (NFS, disk) or reboot. This is the only state where kill -9 fails.

Q4: What is a zombie process and how do you get rid of it?

A process that has exited but whose parent hasn't called wait(). Uses only a PID slot. You can't kill it (it's already dead) — kill or fix the parent instead.

Q5: systemctl enable myapp — is the service running now?

No. enable only creates a boot-time symlink. Use enable --now to also start it immediately.

Q6: Why do you need daemon-reload after editing a unit file?

systemd caches unit definitions in memory. Without daemon-reload, it uses the old cached version and your changes have no effect.

Q7: Why does docker stop take 10 seconds for some containers?

Docker sends SIGTERM, but if PID 1 hasn't registered a SIGTERM handler, the signal is ignored. Docker waits 10 seconds then sends SIGKILL. Fix: use tini or --init.

Q8: Exit code 137 — what happened?

128 + 9 = SIGKILL. Either the OOM killer or someone ran kill -9. Exit code 143 = 128 + 15 = SIGTERM.

Exercises¶

Exercise 1: Find the stuck process (investigation)¶

You have a process tree where something is hanging. Given this pstree output:

bash(1000)───deploy.sh(1001)─┬─ssh(1002)───apt-get(1003)
                              ├─curl(1004)
                              └─sleep(1005)

And this ps output:

  PID STAT WCHAN             COMM
 1001 S    wait               deploy.sh
 1002 S    poll_schedule_to   ssh
 1003 D    nfs_wait_bit_kil   apt-get
 1004 S    poll_schedule_to   curl
 1005 S    hrtimer_nanosle    sleep

Which process is stuck? Can you kill it? What should you do?

Answer

PID 1003 (`apt-get`) is in state `D` (uninterruptible sleep) with wchan `nfs_wait_bit_kil` — it's waiting on an NFS operation. You **cannot** kill it, not even with SIGKILL. The NFS server is likely unreachable. Fix the NFS mount (check network, restart NFS server) and the process will unblock. If you can't fix NFS, the only option is a reboot.

Exercise 2: Write a safe deploy script (bash)¶

Write a bash script that: 1. SSHes to a server and runs systemctl restart myapp 2. Waits up to 60 seconds for the service to be healthy (check curl -f http://server:8080/health) 3. If the health check fails, rolls back by running systemctl restart myapp-previous 4. Cleans up properly on Ctrl+C or any error

Hint 1

Use `set -euo pipefail` and `trap cleanup EXIT` as your foundation.

Hint 2

Use a `for` loop with `sleep` for the health check, not an infinite loop. The exit code of `curl -f` tells you if the endpoint returned 2xx.

Solution

#!/bin/bash
set -euo pipefail

SERVER="target-server"
TIMEOUT=60
HEALTH_URL="http://${SERVER}:8080/health"
ROLLED_BACK=false

cleanup() {
    local exit_code=$?
    if [[ $exit_code -ne 0 ]] && [[ "$ROLLED_BACK" != "true" ]]; then
        echo "ERROR: Deploy failed (exit $exit_code), rolling back..."
        ssh "$SERVER" "systemctl restart myapp-previous" || true
        ROLLED_BACK=true
    fi
    exit "$exit_code"
}
trap cleanup EXIT

echo "Restarting myapp on $SERVER..."
ssh "$SERVER" "systemctl restart myapp"

echo "Waiting up to ${TIMEOUT}s for health check..."
for i in $(seq 1 "$TIMEOUT"); do
    if curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
        echo "Healthy after ${i}s"
        exit 0
    fi
    sleep 1
done

echo "Health check timed out after ${TIMEOUT}s"
exit 1

Exercise 3: Zombie factory (hands-on)¶

Create a zombie process on purpose, verify it exists, then clean it up.

Hint

You need a parent that forks a child but never calls `wait()`. In bash, backgrounding a process and then sleeping forever (without `wait`) does this after the child exits.

Solution

# Terminal 1: Create a zombie
bash -c '
    # Fork a child that exits immediately
    /bin/true &
    CHILD=$!
    echo "Child PID: $CHILD, Parent PID: $$"
    # Sleep without ever calling wait
    sleep 300
' &
PARENT=$!

# Wait a moment for the child to exit
sleep 2

# Verify zombie exists
ps -eo pid,ppid,stat,comm | grep Z
# → You should see the child in Z state with PPID = the bash subshell

# Clean it up by killing the parent
kill $PARENT

# Verify zombie is gone (PID 1 reaped it)
sleep 1
ps -eo pid,ppid,stat,comm | grep Z
# → Should be empty

Exercise 4: Write a systemd unit (design)¶

Your team has a Python web app that: - Lives at /opt/webapp/app.py - Needs PostgreSQL running first - Should restart if it crashes, but stop retrying after 5 failures in 2 minutes - Needs DATABASE_URL from /opt/webapp/.env - Should not be able to write anywhere except /opt/webapp/data/

Write the systemd unit file. Don't look at the solution until you've tried.

Hint

You need `Requires=` + `After=` for PostgreSQL, `EnvironmentFile=` for the env vars, `Restart=on-failure` + `StartLimitBurst` + `StartLimitIntervalSec` for retry limits, and `ProtectSystem=strict` + `ReadWritePaths=` for filesystem isolation.

Solution

[Unit]
Description=Web Application
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/usr/bin/python3 /opt/webapp/app.py
EnvironmentFile=/opt/webapp/.env

Restart=on-failure
RestartSec=5
StartLimitBurst=5
StartLimitIntervalSec=120

ProtectSystem=strict
ReadWritePaths=/opt/webapp/data
NoNewPrivileges=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Exercise 5: The decision (think, don't code)¶

For each scenario, decide: bash with trap and wait, a systemd service, or a container with tini? Justify your choice.

A one-shot backup script that runs nightly via cron
A long-running API server that must survive reboots
A batch job that processes files from a queue and should retry on failure
A deploy script that SSHes to 5 servers sequentially
A microservice that runs alongside 15 others on the same host

Answers

1. **Systemd timer** (not cron). You get logging via `journalctl`, automatic cgroup resource limits, `Persistent=true` to catch up missed runs, and `RandomizedDelaySec` to avoid thundering herd. Plain cron gives you none of this. 2. **Systemd service.** `Restart=on-failure`, `WantedBy=multi-user.target` for boot, resource limits, journal logging. This is exactly what systemd was built for. 3. **Systemd service** with `Restart=on-failure` and `RestartSec=` for backoff. A container works too if you're already containerized, but the retry logic comes from the orchestrator (systemd or Kubernetes), not the container itself. 4. **Bash with `trap` and `set -euo pipefail`.** It's a one-shot script, not a long-running service. A systemd service is overkill. But the script needs proper signal handling and cleanup — this is where `trap cleanup EXIT` earns its keep. 5. **Container with `tini` (or `--init`).** 15 microservices on one host = you need isolation. Containers give you PID namespaces, cgroup limits, filesystem isolation, and independent logging. `tini` handles signals and zombie reaping inside each container.

Cheat Sheet¶

Process Investigation¶

What you need	Command
Process tree	`pstree -p <pid>`
What a process is doing	`cat /proc/<pid>/wchan`
Process state	`ps -o pid,stat,comm -p <pid>`
All processes sorted by CPU	`ps aux --sort=-%cpu \\| head`
All processes sorted by memory	`ps aux --sort=-rss \\| head`
Find zombies	`ps -eo pid,ppid,stat,comm \\| awk '$3 ~ /Z/'`
Find D-state (unkillable)	`ps aux \\| awk '$8 ~ /^D/'`
Open files for a process	`ls /proc/<pid>/fd \\| wc -l`
Deleted but open files (eating disk)	`lsof +L1`
Kernel stack of stuck process	`cat /proc/<pid>/stack` (root)
Network connections per process	`ss -tnp \\| grep "pid=<pid>"`

Signals¶

Signal	Num	Meaning	Catchable?
SIGHUP	1	Hangup / reload config	Yes
SIGINT	2	Interrupt (Ctrl+C)	Yes
SIGKILL	9	Force kill	No
SIGTERM	15	Graceful terminate	Yes
SIGSTOP	19	Freeze (Ctrl+Z)	No
SIGCONT	18	Resume	Yes

Exit code = 128 + signal number (137 = SIGKILL, 143 = SIGTERM)

systemd¶

Task	Command
Start/stop/restart	`systemctl {start,stop,restart} <unit>`
Status + recent logs	`systemctl status <unit>`
Follow logs live	`journalctl -u <unit> -f`
Enable at boot + start now	`systemctl enable --now <unit>`
Reload after editing unit file	`systemctl daemon-reload`
See what's broken	`systemctl list-units --failed`
View unit with overrides	`systemctl cat <unit>`
Boot time analysis	`systemd-analyze blame`

Bash Script Safety¶

#!/bin/bash
set -euo pipefail          # EUP: Exit, Unset, Pipefail
trap cleanup EXIT           # Always clean up
children=(); cmd & children+=($!)  # Track child PIDs
wait                        # Let traps fire between waits

Takeaways¶

Every command is fork + exec + wait. Your shell forks a copy of itself, the copy replaces itself with the command, the shell waits. When it doesn't return, something in that chain is stuck.
TERM first, KILL last. SIGTERM lets the process clean up. SIGKILL doesn't. Exit code 137 = SIGKILL, 143 = SIGTERM. Starting with kill -9 is like unplugging a server to reboot it.
D-state is unkillable. If a process is in uninterruptible sleep, no signal reaches it. Fix the I/O subsystem or reboot. Everything else can be killed.
Zombies are harmless until they're not. One zombie is fine. Thousands exhaust the PID space. Kill the parent, not the zombie.
systemd replaces hundreds of lines of bash. Unit files give you restart logic, resource limits, dependency management, logging, and cgroup tracking — for free.
PID 1 in containers needs an init. Without tini or --init, your app doesn't handle signals correctly and zombies accumulate. This causes the mysterious 10-second docker stop delay.

What Happens When You Click a Link — end-to-end trace through networking
Connection Refused — differential diagnosis of a common error across layers

The Hanging Deploy

The Mission¶

Part 1: What Just Happened When You Ran That Script¶

Why this matters for your stuck deploy¶

Part 2: Talking to Processes — Signals¶

Why Ctrl+C didn't work¶

The correct kill sequence¶

Part 3: Bash Job Control — Background, Foreground, and Traps¶

Running things in the background¶

The trap that saves your deploy¶

What about nohup?¶

Part 4: Zombies and Orphans¶

Why do zombies exist?¶

When zombies are a problem¶

Orphans: the other half of the story¶

Part 5: systemd — Process Management Done Right¶

The commands you'll use every day¶

systemd's dependency model¶

Debugging a service that won't start¶

Part 6: Containers Change the Rules¶

PID 1 in containers is special (and dangerous)¶

Cgroups: why the OOM killer finds your container¶

PID namespaces: the container's alternate reality¶

Flashcard Check¶

Exercises¶

Exercise 1: Find the stuck process (investigation)¶

Exercise 2: Write a safe deploy script (bash)¶

Exercise 3: Zombie factory (hands-on)¶

Exercise 4: Write a systemd unit (design)¶

Exercise 5: The decision (think, don't code)¶

Cheat Sheet¶

Process Investigation¶

Signals¶

systemd¶

Bash Script Safety¶

Takeaways¶

Related Lessons¶

Pages that link here¶

What about `nohup`?¶