Thinking Out Loud: Linux Ops — systemd¶

A senior SRE's internal monologue while working through a real systemd task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation¶

A critical background worker service keeps dying and restarting every 3-4 minutes. systemd restarts it each time, but the restarts are causing lost work (each restart drops the in-progress queue items). I need to figure out why it's dying and fix it — without just cranking up the restart limit.

The Monologue¶

Service keeps dying and restarting. Let me check the current state and recent restart pattern.

systemctl status worker-processor

Active (running) since 2 minutes ago. Main PID: 14832. "Start request repeated too quickly" — wait, no, that's not showing. But the restart count... let me check the journal for the failure pattern.

journalctl -u worker-processor --since "1 hour ago" --no-pager | grep -E "Started|Stopped|Main process exited"

Six restarts in the last hour. Main process exited with code=killed, signal=KILL. SIGKILL, not SIGTERM. The process isn't crashing on its own — something is forcibly killing it. That's usually the OOM killer or systemd itself (via a resource limit).

journalctl -u worker-processor --since "30 min ago" | grep -i -E "oom|memory|cgroup|killed"

"worker-processor.service: A]process of this unit has been killed by the OOM killer." There it is. The OOM killer is targeting this service. But let me check — is it the system-wide OOM killer or the cgroup-scoped one?

systemctl show worker-processor | grep -E "MemoryMax|MemoryHigh|MemoryCurrent|MemoryLimit"

MemoryMax=512M. There's a cgroup memory limit of 512MB. When the process exceeds it, the cgroup OOM killer fires and kills it. This is different from the system-wide OOM killer — this is scoped specifically to this service.

Mental Model: Cgroup OOM vs System OOM¶

systemd services can have cgroup memory limits (MemoryMax). When a process exceeds its cgroup limit, the cgroup OOM killer fires immediately — it doesn't wait for system-wide memory pressure. This is often surprising because the system has plenty of free RAM, but the service still gets OOM-killed. Check MemoryMax before investigating system memory.

Is 512MB enough for this service? Let me check what it actually uses.

systemctl show worker-processor -p MemoryCurrent

MemoryCurrent=487M. It's at 487MB out of 512MB right now, and it just restarted 2 minutes ago. It's consuming memory fast. Let me watch it climb.

watch -n 5 'systemctl show worker-processor -p MemoryCurrent'

490... 495... 501... it's climbing about 3MB every 5 seconds. At this rate, it'll hit 512MB in about 20 seconds. That matches the 3-4 minute restart cycle — it takes about 3 minutes from start to hit the limit.

This is a memory leak. The process isn't freeing memory as it processes queue items. But the question is: should I increase the limit or fix the leak?

For now, I need to stop the restart-kill-restart cycle because each restart drops queue items. Let me increase the limit temporarily while I investigate the leak.

sudo systemctl edit worker-processor

I'll add a drop-in override:

sudo mkdir -p /etc/systemd/system/worker-processor.service.d/
cat <<'EOF' | sudo tee /etc/systemd/system/worker-processor.service.d/memory-override.conf
[Service]
MemoryMax=2G
MemoryHigh=1536M
EOF
sudo systemctl daemon-reload
sudo systemctl restart worker-processor

I set MemoryMax to 2GB (hard kill limit) and MemoryHigh to 1.5GB (soft throttling limit). When the process hits 1.5GB, the kernel will slow down its memory allocations rather than killing it. The 2GB hard limit is a safety net.

Mental Model: MemoryHigh vs MemoryMax¶

MemoryHigh is a soft limit — the kernel throttles the process (slows it down) when it exceeds this, but doesn't kill it. MemoryMax is a hard kill limit. Using both gives you a graduated response: the process slows down at MemoryHigh, giving you time to notice, and only gets killed at MemoryMax if it completely runs away. Always set MemoryHigh below MemoryMax.

Now let me figure out the actual leak. The service is a Python worker. Let me check what it does.

cat /etc/systemd/system/worker-processor.service | grep ExecStart

ExecStart=/opt/worker/venv/bin/python3 /opt/worker/processor.py. Let me look at the process and its memory layout.

sudo cat /proc/$(systemctl show -p MainPID worker-processor --value)/smaps_rollup

Rss: 520MB, Private_Dirty: 495MB. Almost all the memory is private dirty pages — heap allocations. This is consistent with a Python application accumulating objects in memory.

Let me check if there's a simple cause — like a growing list or dict that's never cleaned up.

sudo gdb -batch -ex 'call PyRun_SimpleString("import gc; gc.collect(); print(len(gc.get_objects()))")' -p $(systemctl show -p MainPID worker-processor --value) 2>/dev/null

Actually, that's too invasive for a production process. Let me look at the code instead.

head -80 /opt/worker/processor.py

There it is — line 23: processed_items = [] at module level, and line 45: processed_items.append(item.id). They're appending every processed item ID to a module-level list and never clearing it. Classic Python memory leak — the list grows forever.

The fix is simple: either clear the list periodically, use a fixed-size deque, or just remove the tracking if it's not needed. But I can't modify the code in production right now. Let me add a systemd restart schedule as a pragmatic workaround.

cat <<'EOF' | sudo tee /etc/systemd/system/worker-processor.service.d/restart-policy.conf
[Service]
RuntimeMaxSec=3600
Restart=always
RestartSec=5
EOF
sudo systemctl daemon-reload

RuntimeMaxSec=3600 tells systemd to gracefully restart the service every hour. That keeps the memory from growing unbounded. Combined with the 2GB hard limit, we have two safety nets. But the real fix is the code change — I'll file a PR.

Let me verify the service is running healthily with the new limits.

systemctl status worker-processor
systemctl show worker-processor -p MemoryCurrent,MemoryMax,MemoryHigh,RuntimeMaxUSec

Running, 180MB current usage (just restarted), limits in place. Good. Let me also check the queue to make sure no items were lost during the last restart.

sudo -u worker /opt/worker/venv/bin/python3 -c "from processor import queue; print(f'Queue depth: {queue.qsize()}')"

Queue is processing. We're stable for now. I'll send the dev team the fix for the growing list and we can remove the RuntimeMaxSec once the leak is fixed.

What Made This Senior-Level¶

Junior Would...	Senior Does...	Why
Increase MemoryMax and call it done	Increase MemoryMax as a stopgap, then investigate the memory leak	Increasing the limit without understanding the leak just buys time — the leak will eventually hit any limit
Not know about MemoryHigh	Use MemoryHigh for soft throttling and MemoryMax as a hard kill limit	Graduated response gives warning before catastrophic kill
Not think about RuntimeMaxSec	Add periodic restart as a second safety net for the memory leak	Proactive restarts on a known-leaky service are better than OOM kills
Just look at the OOM kill and assume the system is out of memory	Check cgroup limits first, then system memory	Cgroup OOM is scoped — the system can have plenty of RAM while the service is over its cgroup limit

Key Heuristics Used¶

SIGKILL on a Service = Check Cgroup Limits First: If a systemd service is being SIGKILL'd, check MemoryMax and other cgroup resource limits before investigating system-wide issues.
MemoryHigh + MemoryMax = Graduated Response: Use both for a soft throttle (MemoryHigh) and hard kill (MemoryMax). Never use just MemoryMax alone.
RuntimeMaxSec for Leaky Services: If a service leaks memory, add periodic restarts via RuntimeMaxSec as a safety net while the code fix is developed.

Cross-References¶

Primer — systemd service configuration, cgroup resource controls, and unit file anatomy
Street Ops — systemd debugging commands, journal analysis, and resource limit configuration
Footguns — Cgroup OOM killing with free system RAM and increasing limits without fixing leaks