Thinking Out Loud: Process Management¶

A senior SRE's internal monologue while working through a real process management task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

A team reports that deploys to a production server are failing because the application (a Node.js service) "won't stop." The deploy script sends SIGTERM and waits 30 seconds, but the process doesn't exit. The script then sends SIGKILL, which works but causes dropped connections and incomplete transactions. I need to figure out why the graceful shutdown isn't working.

The Monologue¶

Process ignoring SIGTERM. There are a few common reasons: the process is handling SIGTERM and not actually exiting, it's stuck in an uninterruptible I/O state, or the signal is being delivered to the wrong process (e.g., a shell wrapper is catching it).

Let me check how the service is started.

systemctl cat app-server

ExecStart=/bin/bash -c 'source /opt/app/.env && exec node /opt/app/server.js'. Interesting — they're using a bash wrapper. Let me check: when systemd sends SIGTERM, it sends it to the main PID. If the main PID is bash (the wrapper) and not node, then bash receives the signal, not node. And bash with a running child will... well, it depends.

Actually, they used exec which replaces the bash process with node. So the PID should be the node process. Let me verify.

systemctl show app-server -p MainPID
ps -p $(systemctl show app-server -p MainPID --value) -o pid,comm,args

Main PID is a node process. So the signal IS going to node. Then why isn't it shutting down?

Let me check if the process is handling SIGTERM.

cat /proc/$(systemctl show app-server -p MainPID --value)/status | grep SigCgt

SigCgt: 0000000180004002. Let me decode that bitmask. Signal 15 (SIGTERM) is bit 14 (0-indexed)... 2^14 = 16384 = 0x4000. And 0x4002 includes that. Yes, the process IS catching SIGTERM. So the app has a signal handler, but it's not exiting.

Mental Model: Signal Handler Debugging¶

A process catching a signal doesn't mean it handles it correctly. Check /proc/<pid>/status SigCgt field to confirm which signals are caught. Then check whether the handler actually does what you think — a common bug is registering a SIGTERM handler that starts cleanup but never calls process.exit().

Let me look at the application code to see what the signal handler does.

grep -n "SIGTERM\|graceful\|shutdown\|process.on" /opt/app/server.js

Line 142: process.on('SIGTERM', () => { logger.info('Shutting down...'); server.close(); }). They call server.close() which stops accepting new connections. But server.close() waits for existing connections to finish. If there are long-lived connections (WebSockets, keep-alive HTTP), this can wait forever.

Let me check if there are long-lived connections keeping the server open.

ss -tnp | grep $(systemctl show app-server -p MainPID --value) | wc -l
ss -tnp | grep $(systemctl show app-server -p MainPID --value) | head -10

247 established connections. Most of them are in ESTABLISHED state with various client IPs. These are keep-alive HTTP connections from the load balancer health checks and long-polling clients. server.close() is waiting for all 247 connections to close, and they won't close until the clients disconnect or the keep-alive timeout expires.

Mental Model: Graceful Shutdown = Close + Timeout + Force¶

A proper graceful shutdown has three phases: (1) stop accepting new connections, (2) wait for in-flight requests to complete with a TIMEOUT, (3) force-close any remaining connections and exit. Most implementations only do step 1 and wait forever. You need the timeout to prevent exactly this situation.

The fix is to add a shutdown timeout to the application. But I can't modify the code right now during a deploy window. Let me fix the deploy script first, then work with the dev team on the code.

For the deploy script, the SIGTERM + wait 30s + SIGKILL approach is actually correct, but 30 seconds isn't enough because the connections have a 60-second keep-alive timeout. Let me check.

grep -i "keepalive\|timeout" /opt/app/server.js

server.keepAliveTimeout = 65000 — 65 seconds. So the connections won't close for at least 65 seconds after the last request. And the deploy script gives up after 30 seconds. The process is trying to shut down gracefully, but the timeout in the deploy script is shorter than the connection timeout.

Quick fix: bump the deploy script timeout to 90 seconds.

Better fix: add a forced-close timeout in the application's SIGTERM handler.

Let me do both. First, the deploy script.

grep -n "SIGTERM\|sleep\|SIGKILL" /opt/deploy/deploy.sh

The script sends SIGTERM, sleeps 30, then sends SIGKILL. Let me also check the systemd TimeoutStopSec.

systemctl show app-server -p TimeoutStopSec

TimeoutStopSec=90s. Systemd will wait 90 seconds after SIGTERM before sending SIGKILL. But the deploy script sends its own SIGKILL at 30 seconds, bypassing systemd's graceful timeout. The deploy script should use systemctl stop instead of sending signals directly.

grep "kill\|stop" /opt/deploy/deploy.sh

kill -TERM $PID && sleep 30 && kill -9 $PID. Yeah, they're killing the process directly instead of using systemd. This means systemd doesn't know the service was stopped intentionally, and it might try to restart it (if Restart=always). Plus they're bypassing systemd's own timeout handling.

The proper fix for the deploy script is:

# Replace the manual kill logic with:
sudo systemctl stop app-server

Systemd will send SIGTERM, wait TimeoutStopSec (90s), then SIGKILL. This is exactly the behavior they want, and it's already configured.

But I also want to fix the application to shut down faster. Let me propose a code change:

// Add timeout to graceful shutdown
process.on('SIGTERM', () => {
  logger.info('Shutting down...');
  server.close(() => {
    process.exit(0);
  });
  // Force exit after 30 seconds if connections don't close
  setTimeout(() => {
    logger.warn('Forcing shutdown after timeout');
    process.exit(1);
  }, 30000).unref();
});

That gives existing requests 30 seconds to complete, then force-exits. Combined with systemd's 90-second timeout as a backstop, this is robust.

Let me also lower the keep-alive timeout to something more reasonable.

For now, let me fix the deploy by switching to systemctl stop and verify it works.

time sudo systemctl stop app-server

Stopped in... 68 seconds. That's the keep-alive timeout draining. With the code fix, this would be 30 seconds. Without the code fix, 68 seconds is acceptable and the deploy will work.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
Add more SIGKILLs or shorter sleep times	Investigate WHY the process isn't shutting down to SIGTERM	SIGKILL is always a symptom, never a fix — it means the graceful path is broken
Not check if the signal is actually being caught	Read `/proc/<pid>/status` SigCgt bitmask to verify signal handling	This immediately narrows the problem from "signal not received" to "handler not exiting"
Not notice the deploy script bypasses systemd	Recognize that manual kill commands bypass systemd's stop logic	Using `systemctl stop` gives you systemd's built-in timeout, restart tracking, and proper lifecycle management
Only fix the deploy script timeout	Fix both the deploy script AND propose the application-level shutdown timeout	Defense in depth: the app should handle its own shutdown, with systemd as a backstop

Key Heuristics Used¶

Graceful Shutdown = Stop + Timeout + Force: A shutdown handler that waits indefinitely for connections to drain is not graceful — it's a hang. Always add a timeout.
Use systemctl stop, Not kill: When a service is managed by systemd, use systemd to stop it. Manual signals bypass lifecycle management, restart tracking, and configured timeouts.
Check the Signal Path: Verify the signal is caught (SigCgt), delivered to the right PID (check for shell wrappers), and that the handler actually calls exit.

Cross-References¶

Primer — Linux signals, process lifecycle, and how signal delivery works
Street Ops — Process inspection commands and signal debugging techniques
Footguns — Shell wrappers absorbing signals, server.close() waiting forever, and manual kill bypassing systemd