Thinking Out Loud: Linux Ops — systemd¶
A senior SRE's internal monologue while working through a real systemd task. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A critical background worker service keeps dying and restarting every 3-4 minutes. systemd restarts it each time, but the restarts are causing lost work (each restart drops the in-progress queue items). I need to figure out why it's dying and fix it — without just cranking up the restart limit.
The Monologue¶
Service keeps dying and restarting. Let me check the current state and recent restart pattern.
Active (running) since 2 minutes ago. Main PID: 14832. "Start request repeated too quickly" — wait, no, that's not showing. But the restart count... let me check the journal for the failure pattern.
journalctl -u worker-processor --since "1 hour ago" --no-pager | grep -E "Started|Stopped|Main process exited"
Six restarts in the last hour. Main process exited with code=killed, signal=KILL. SIGKILL, not SIGTERM. The process isn't crashing on its own — something is forcibly killing it. That's usually the OOM killer or systemd itself (via a resource limit).
"worker-processor.service: A]process of this unit has been killed by the OOM killer." There it is. The OOM killer is targeting this service. But let me check — is it the system-wide OOM killer or the cgroup-scoped one?
MemoryMax=512M. There's a cgroup memory limit of 512MB. When the process exceeds it, the cgroup OOM killer fires and kills it. This is different from the system-wide OOM killer — this is scoped specifically to this service.
Mental Model: Cgroup OOM vs System OOM¶
systemd services can have cgroup memory limits (
MemoryMax). When a process exceeds its cgroup limit, the cgroup OOM killer fires immediately — it doesn't wait for system-wide memory pressure. This is often surprising because the system has plenty of free RAM, but the service still gets OOM-killed. CheckMemoryMaxbefore investigating system memory.
Is 512MB enough for this service? Let me check what it actually uses.
MemoryCurrent=487M. It's at 487MB out of 512MB right now, and it just restarted 2 minutes ago. It's consuming memory fast. Let me watch it climb.
490... 495... 501... it's climbing about 3MB every 5 seconds. At this rate, it'll hit 512MB in about 20 seconds. That matches the 3-4 minute restart cycle — it takes about 3 minutes from start to hit the limit.
This is a memory leak. The process isn't freeing memory as it processes queue items. But the question is: should I increase the limit or fix the leak?
For now, I need to stop the restart-kill-restart cycle because each restart drops queue items. Let me increase the limit temporarily while I investigate the leak.
I'll add a drop-in override:
sudo mkdir -p /etc/systemd/system/worker-processor.service.d/
cat <<'EOF' | sudo tee /etc/systemd/system/worker-processor.service.d/memory-override.conf
[Service]
MemoryMax=2G
MemoryHigh=1536M
EOF
sudo systemctl daemon-reload
sudo systemctl restart worker-processor
I set MemoryMax to 2GB (hard kill limit) and MemoryHigh to 1.5GB (soft throttling limit). When the process hits 1.5GB, the kernel will slow down its memory allocations rather than killing it. The 2GB hard limit is a safety net.
Mental Model: MemoryHigh vs MemoryMax¶
MemoryHighis a soft limit — the kernel throttles the process (slows it down) when it exceeds this, but doesn't kill it.MemoryMaxis a hard kill limit. Using both gives you a graduated response: the process slows down atMemoryHigh, giving you time to notice, and only gets killed atMemoryMaxif it completely runs away. Always setMemoryHighbelowMemoryMax.
Now let me figure out the actual leak. The service is a Python worker. Let me check what it does.
ExecStart=/opt/worker/venv/bin/python3 /opt/worker/processor.py. Let me look at the process and its memory layout.
Rss: 520MB, Private_Dirty: 495MB. Almost all the memory is private dirty pages — heap allocations. This is consistent with a Python application accumulating objects in memory.
Let me check if there's a simple cause — like a growing list or dict that's never cleaned up.
sudo gdb -batch -ex 'call PyRun_SimpleString("import gc; gc.collect(); print(len(gc.get_objects()))")' -p $(systemctl show -p MainPID worker-processor --value) 2>/dev/null
Actually, that's too invasive for a production process. Let me look at the code instead.
There it is — line 23: processed_items = [] at module level, and line 45: processed_items.append(item.id). They're appending every processed item ID to a module-level list and never clearing it. Classic Python memory leak — the list grows forever.
The fix is simple: either clear the list periodically, use a fixed-size deque, or just remove the tracking if it's not needed. But I can't modify the code in production right now. Let me add a systemd restart schedule as a pragmatic workaround.
cat <<'EOF' | sudo tee /etc/systemd/system/worker-processor.service.d/restart-policy.conf
[Service]
RuntimeMaxSec=3600
Restart=always
RestartSec=5
EOF
sudo systemctl daemon-reload
RuntimeMaxSec=3600 tells systemd to gracefully restart the service every hour. That keeps the memory from growing unbounded. Combined with the 2GB hard limit, we have two safety nets. But the real fix is the code change — I'll file a PR.
Let me verify the service is running healthily with the new limits.
systemctl status worker-processor
systemctl show worker-processor -p MemoryCurrent,MemoryMax,MemoryHigh,RuntimeMaxUSec
Running, 180MB current usage (just restarted), limits in place. Good. Let me also check the queue to make sure no items were lost during the last restart.
sudo -u worker /opt/worker/venv/bin/python3 -c "from processor import queue; print(f'Queue depth: {queue.qsize()}')"
Queue is processing. We're stable for now. I'll send the dev team the fix for the growing list and we can remove the RuntimeMaxSec once the leak is fixed.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Increase MemoryMax and call it done | Increase MemoryMax as a stopgap, then investigate the memory leak | Increasing the limit without understanding the leak just buys time — the leak will eventually hit any limit |
| Not know about MemoryHigh | Use MemoryHigh for soft throttling and MemoryMax as a hard kill limit | Graduated response gives warning before catastrophic kill |
| Not think about RuntimeMaxSec | Add periodic restart as a second safety net for the memory leak | Proactive restarts on a known-leaky service are better than OOM kills |
| Just look at the OOM kill and assume the system is out of memory | Check cgroup limits first, then system memory | Cgroup OOM is scoped — the system can have plenty of RAM while the service is over its cgroup limit |
Key Heuristics Used¶
- SIGKILL on a Service = Check Cgroup Limits First: If a systemd service is being SIGKILL'd, check MemoryMax and other cgroup resource limits before investigating system-wide issues.
- MemoryHigh + MemoryMax = Graduated Response: Use both for a soft throttle (MemoryHigh) and hard kill (MemoryMax). Never use just MemoryMax alone.
- RuntimeMaxSec for Leaky Services: If a service leaks memory, add periodic restarts via RuntimeMaxSec as a safety net while the code fix is developed.
Cross-References¶
- Primer — systemd service configuration, cgroup resource controls, and unit file anatomy
- Street Ops — systemd debugging commands, journal analysis, and resource limit configuration
- Footguns — Cgroup OOM killing with free system RAM and increasing limits without fixing leaks