systemd: The Init System You Can't Avoid
- lesson
- systemd
- unit-files
- journald
- timers
- cgroups
- socket-activation
- security-hardening ---# systemd — The Init System You Can't Avoid
Topics: systemd, unit files, journald, timers, cgroups, socket activation, security hardening Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
It's 7:14 AM. You're not even at your desk yet when PagerDuty fires: payment-processor
is crash-looping. The service restarts, runs for 45 seconds, dies, restarts, runs for 45
seconds, dies. Payments are failing. The on-call engineer from the night shift left a note:
"restarted it three times, seemed fine each time, went back to sleep."
Your job: figure out why it's crash-looping, stop the bleeding, and harden the service so this doesn't happen at 7 AM again. Along the way, you're going to learn systemd deeper than most engineers ever go — unit types, dependency ordering, journald forensics, timer units, socket activation, resource controls, and security sandboxing.
Part 1: The First 60 Seconds — Reading the Wreckage¶
Two commands before anything else:
● payment-processor.service - Payment Processing Worker
Loaded: loaded (/etc/systemd/system/payment-processor.service; enabled)
Active: activating (auto-restart) (Result: exit-code)
Process: 28491 ExecStart=/opt/payments/bin/processor --config /etc/payments/app.conf (code=exited, status=1/FAILURE)
Main PID: 28491 (code=exited, status=1/FAILURE)
CPU: 892ms
Three things jump out:
| What you see | What it means |
|---|---|
activating (auto-restart) |
systemd is in the delay between crash and next restart |
code=exited, status=1/FAILURE |
Process exited with code 1 — not killed, it chose to exit |
CPU: 892ms |
It barely ran — something fails fast |
Now the logs:
A repeating pattern every ~50 seconds:
07:13:22 payment-processor[28344]: Starting payment processor v2.4.1
07:13:22 payment-processor[28344]: Connecting to database at db-primary.internal:5432
07:13:22 payment-processor[28344]: Connected. Loading payment queue...
07:14:07 payment-processor[28344]: FATAL: database connection lost: SSL handshake timeout
07:14:07 systemd[1]: payment-processor.service: Main process exited, code=exited, status=1/FAILURE
07:14:07 systemd[1]: payment-processor.service: Failed with result 'exit-code'.
07:14:12 systemd[1]: payment-processor.service: Scheduled restart job, restart counter is at 14.
The service connects to the database, runs for 45 seconds, then the connection drops with an SSL timeout. systemd restarts it 5 seconds later, and the cycle repeats.
Mental Model: When debugging a restart loop, your first question is always: is the service crashing (exit code 1), being killed (signal 9/SIGKILL), or timing out? Exit code 1 = the application decided to die (check app logs). SIGKILL = something external killed it (check OOM killer,
MemoryMax). Timeout = the process isn't stopping cleanly (checkTimeoutStopSec).
The current unit file:
# /etc/systemd/system/payment-processor.service
[Unit]
Description=Payment Processing Worker
After=network.target
[Service]
Type=simple
ExecStart=/opt/payments/bin/processor --config /etc/payments/app.conf
Restart=always
RestartSec=5
User=payments
Group=payments
[Install]
WantedBy=multi-user.target
This unit file has problems. We'll fix them all by the end.
Part 2: Stop the Bleeding¶
The database team confirms: they're rotating SSL certificates on db-primary. Connections
with the old cert are getting killed. The new cert will be ready in 20 minutes.
This sticks even though the service has Restart=always. systemctl stop is an explicit
admin action — systemd distinguishes "the process crashed" (triggers restart) from "an admin
said stop" (obeys).
Gotcha: There is a case where stopping a service doesn't stick: socket activation. If
payment-processor.socketexists, any incoming connection re-triggers the service. Always check:systemctl list-units 'payment-processor.*'
Part 3: Unit Types — Everything Is a Unit¶
| Unit type | Suffix | What it does | Example |
|---|---|---|---|
| service | .service |
A process or group of processes | nginx.service |
| socket | .socket |
An IPC or network socket | cups.socket |
| timer | .timer |
Triggers a service on a schedule | logrotate.timer |
| mount | .mount |
A filesystem mount point | var-log.mount |
| target | .target |
A group of units (like a runlevel) | multi-user.target |
| slice | .slice |
A cgroup resource boundary | user.slice |
For daily ops you'll use services (90%), timers (replacing cron), and occasionally sockets.
Where unit files live matters:
/etc/systemd/system/ → Admin overrides (highest priority)
/run/systemd/system/ → Runtime/transient units (ephemeral)
/usr/lib/systemd/system/ → Vendor defaults (lowest priority)
Remember: Priority mnemonic: ERC — Etc, Run, usr/lib (Core). Never edit files in
/usr/lib/— package updates overwrite them. Usesystemctl edit <unit>.
Part 4: Rewriting the Unit File¶
Here's the hardened version:
# /etc/systemd/system/payment-processor.service
[Unit]
Description=Payment Processing Worker
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service
[Service]
Type=notify
ExecStartPre=/opt/payments/bin/processor --validate-config /etc/payments/app.conf
ExecStart=/opt/payments/bin/processor --config /etc/payments/app.conf
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5
User=payments
Group=payments
WorkingDirectory=/opt/payments
EnvironmentFile=/etc/payments/env
# Resource controls
MemoryMax=1G
MemoryHigh=768M
CPUQuota=200%
TasksMax=256
LimitNOFILE=65536
# Security hardening
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
NoNewPrivileges=true
ReadWritePaths=/var/lib/payments /var/log/payments
ProtectKernelTunables=true
ProtectKernelModules=true
RestrictSUIDSGID=true
[Install]
WantedBy=multi-user.target
Let's break this into groups.
Dependencies¶
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service
The original had After=network.target. That's a trap.
| Target | What it means |
|---|---|
network.target |
Network interfaces are configured |
network-online.target |
Network is actually reachable |
Gotcha:
After=network.targetis the #1 cause of "works on restart, fails on boot." The service starts before the network is up. Usenetwork-online.targetwithWants=network-online.target(it isn't pulled in by default).Under the Hood:
After=controls ordering.Requires=controls dependency. They're orthogonal.Requires=withoutAfter=starts both simultaneously — your app races against its database.After=withoutRequires=waits for it if it's starting, but doesn't pull it in. You almost always want both together.
Service type and startup¶
Type=notify means the service explicitly tells systemd when it's ready by calling
sd_notify("READY=1"). With Type=simple (the original), systemd considers it "started"
instantly after fork() — health checks and dependent services don't wait for real readiness.
ExecStartPre= validates the config before starting. If invalid, you get a clear error
instead of a crash 10 seconds later.
| Type= value | When systemd considers it "started" | Best for |
|---|---|---|
simple |
Immediately after fork() | Scripts, most binaries |
exec |
After the binary successfully exec()'s | Catching missing-binary errors |
notify |
When the service calls sd_notify() | Apps with startup initialization |
forking |
When the parent process exits | Legacy daemons that double-fork |
oneshot |
When the process exits | Scripts that run and finish |
War Story: A team set
Type=simplefor a Java service that took 30 seconds to initialize its connection pool. The load balancer started sending traffic immediately. Every deploy caused 30 seconds of 503 errors. Switching toType=notifyfixed it — the load balancer didn't get traffic until the connection pool was warm.
Restart policy — the 10-second default that bites everyone¶
| Policy | Restarts on... | Doesn't restart on... |
|---|---|---|
always |
Everything: clean exit, error, signal | Explicit systemctl stop |
on-failure |
Non-zero exit, signal death, timeout | Clean exit (code 0), systemctl stop |
on-abnormal |
Signal, timeout, watchdog | Any exit code (even non-zero) |
War Story: The default
StartLimitIntervalSecis 10 seconds andStartLimitBurstis 5. WithRestartSec=0, a service that crashes on startup hits this limit in under a second. The service enters "failed" state and you get:Start request repeated too quickly.systemctl startrefuses. The fix issystemctl reset-failed <unit>, then fix the real problem. But the real fix is settingRestartSec=5or higher so crash loops never trigger the rate limit.
Flashcard check — dependencies and lifecycle¶
| Question | Cover the answer, then check |
|---|---|
What's the difference between network.target and network-online.target? |
network.target = interfaces configured. network-online.target = network reachable. Use the latter for outbound connections. |
Why pair Requires= with After=? |
Requires= = must be running. After= = start after. Without After=, both start simultaneously. |
What happens when StartLimitBurst is exceeded? |
The unit enters "failed" state. Clear with systemctl reset-failed. |
Type=simple vs Type=notify? |
simple: started on fork(). notify: started when service calls sd_notify("READY=1"). |
Part 5: Resource Controls — cgroups You Didn't Know You Were Using¶
Every systemd service runs inside a cgroup (control group). You can see the hierarchy:
Under the Hood: cgroups v2 uses a single unified hierarchy. systemd was the driving force behind cgroups v2 — Lennart Poettering was one of the strongest advocates. Each service's cgroup lives at
/sys/fs/cgroup/system.slice/<service>.service/. Read raw values directly:cat /sys/fs/cgroup/system.slice/payment-processor.service/memory.current
| Directive | What it does | On exceed |
|---|---|---|
MemoryMax=1G |
Hard memory ceiling | cgroup OOM killer fires (SIGKILL) |
MemoryHigh=768M |
Soft memory throttle | Kernel slows allocations |
CPUQuota=200% |
CPU limit (200% = 2 cores) | Throttled, not killed |
TasksMax=256 |
Max threads/processes | Fork fails with EAGAIN |
LimitNOFILE=65536 |
Max open file descriptors | open() fails with EMFILE |
The killer combination is MemoryHigh + MemoryMax. Think of it as a warning track and a
wall. At 768M the kernel throttles — the process slows but lives. At 1G the OOM killer fires.
This gives you a window to notice before the process dies.
# Current memory usage
systemctl show payment-processor -p MemoryCurrent
# Live cgroup resource monitor
systemd-cgtop
Gotcha: The system can have 32 GB free and your service still gets OOM-killed.
MemoryMaxis cgroup-scoped — it doesn't care about system-wide memory. Always checkMemoryMaxbefore investigating system memory pressure.
Part 6: Security Hardening — Free Protection¶
| Directive | What it does |
|---|---|
ProtectSystem=strict |
Mounts filesystem read-only except ReadWritePaths= |
ProtectHome=true |
/home, /root inaccessible |
PrivateTmp=true |
Isolated /tmp per service |
NoNewPrivileges=true |
No privilege escalation (no setuid, no capabilities) |
ProtectKernelTunables=true |
/proc/sys/, /sys/ read-only |
ProtectKernelModules=true |
Block loading kernel modules |
RestrictSUIDSGID=true |
Prevent creating setuid/setgid files |
If the payment processor gets compromised, the attacker can't write to the filesystem (except two directories), can't read home directories, can't escalate privileges, and can't load kernel modules. All for free.
Gotcha:
ProtectSystem=strictwith noReadWritePaths=means the service can't write anywhere. The error in the journal is often just "Permission denied" — looks like a user/group problem. Exit code226/NAMESPACEinsystemctl statusis the telltale sign that sandboxing directives failed.
Part 7: journald Deep Dive¶
The journal saved us this morning. Let's go deeper.
Structured fields — the killer feature¶
journald stores entries as structured data, not text. Every entry has machine-readable fields:
{
"_PID": "29104",
"_UID": "997",
"_COMM": "processor",
"_SYSTEMD_UNIT": "payment-processor.service",
"MESSAGE": "Connected. Loading payment queue...",
"PRIORITY": "6"
}
These fields are searchable:
# Every log line from PID 29104
journalctl _PID=29104
# Every log line from UID 997 across all services
journalctl _UID=997
Trivia: journald's binary log format was one of the most controversial systemd decisions. Critics: you can't
catandgrepyour logs. Supporters: structured binary enables indexed searching, integrity verification, and fields that text can't represent. The debate helped spawn Devuan (Debian without systemd, 2014).
Patterns you'll actually use¶
journalctl -u payment-processor -f # Follow live
journalctl -u payment-processor -p err --since "1h ago" # Recent errors
journalctl -b -1 # Previous boot logs
journalctl -k # Kernel messages (like dmesg)
journalctl --disk-usage # Journal disk space
journalctl --vacuum-time=7d # Prune old entries
Persistent vs volatile storage¶
By default, journald stores logs in /run/log/journal/ (tmpfs — gone on reboot). For
persistence, create /var/log/journal/ or set Storage=persistent in
/etc/systemd/journald.conf.
Gotcha: Without size limits, persistent storage eats your
/varpartition. Always set:
Part 8: Socket Activation — Why It's Elegant¶
Traditional startup: systemd starts a service, the service opens a socket, clients connect.
Socket activation: systemd opens the socket first, queues connections, starts the service
on first connection, and passes the open fd via $LISTEN_FDS.
# /etc/systemd/system/myapi.socket
[Unit]
Description=My API Socket
[Socket]
ListenStream=8080
Accept=no
[Install]
WantedBy=sockets.target
# /etc/systemd/system/myapi.service
[Unit]
Description=My API Server
Requires=myapi.socket
[Service]
Type=notify
ExecStart=/opt/myapi/bin/server
Why this is elegant:
- Zero-downtime restart. systemd holds the socket during service restart. No dropped connections.
- On-demand startup. Rarely-used services start only when someone connects. Faster boot, less memory.
- Implicit dependency resolution. Service A connects to B's socket before B is running. The connection queues. B starts, inherits the socket, completes the connection.
Trivia: Socket activation was inspired by Apple's launchd (2005, macOS). The idea of passing open file descriptors from a supervisor to a service dates back to inetd (1985), the original Unix "internet super-server." systemd's version is inetd's idea scaled to manage an entire operating system.
Part 9: Timers — Replacing Cron¶
You notice a crontab entry on this server:
No overlap prevention, logs go to a file, no resource limits, no missed-run recovery. Let's replace it.
# /etc/systemd/system/payment-cleanup.service
[Unit]
Description=Clean up stale payment records
After=postgresql.service
[Service]
Type=oneshot
ExecStart=/opt/payments/bin/cleanup-stale --older-than 24h
User=payments
Group=payments
MemoryMax=512M
# /etc/systemd/system/payment-cleanup.timer
[Unit]
Description=Run payment cleanup every 6 hours
[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true
RandomizedDelaySec=300
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable --now payment-cleanup.timer
systemd-analyze calendar "*-*-* 00/6:00:00" # Validate the schedule
| Feature | cron | systemd timer |
|---|---|---|
| Logging | Redirect to file or email | Automatic via journald |
| Missed runs | Lost forever | Persistent=true runs on next boot |
| Overlap prevention | Requires flock wrapper |
Automatic (oneshot type) |
| Resource limits | None | MemoryMax, CPUQuota, etc. |
| Fleet load spread | Not possible | RandomizedDelaySec adds jitter |
Remember: Timer advantages — PLRR: Persistent (missed runs recovered), Logging (journald), Randomized delay (no thundering herd), Resource limits.
Part 10: Transient Units and Boot Analysis¶
The database SSL rotation is done. Before restarting the service, test connectivity with a transient unit — a one-off command with full cgroup isolation:
sudo systemd-run \
--unit=db-connectivity-test \
--property=MemoryMax=256M \
--property=User=payments \
/opt/payments/bin/processor --test-db-connection
The transient unit disappears when the process exits. Logs stay in the journal.
While you're on this server, check boot time:
systemd-analyze # Total boot time
systemd-analyze blame | head -5 # Slowest units
systemd-analyze critical-chain payment-processor.service # Dependency chain
payment-processor.service +301ms
└─postgresql.service @4.112s +412ms
└─network-online.target @4.001s
└─NetworkManager-wait-online.service @1.667s +2.334s
The 2.3-second NetworkManager-wait-online is the real bottleneck. Now you know where to
look if boot time becomes a problem.
Part 11: Restart and Verify¶
systemctl status payment-processor # Running?
journalctl -u payment-processor -f # Watch startup
systemctl show payment-processor -p NRestarts # Should be 0
systemctl show payment-processor -p MemoryMax,MemoryHigh # Limits applied?
systemd-analyze security payment-processor # Security score
Payments are flowing. The restart loop is gone. The service is hardened.
Flashcard Check — Part 2¶
| Question | Cover the answer, then check |
|---|---|
What does ProtectSystem=strict do? |
Mounts filesystem read-only except ReadWritePaths=. Exit code 226/NAMESPACE = sandboxing failure. |
MemoryHigh vs MemoryMax? |
MemoryHigh = soft throttle. MemoryMax = hard kill. Use both for graduated response. |
What does Persistent=true do in a timer? |
Runs the job on next boot if it was missed. Cron can't do this. |
| How do you create a one-off supervised command? | systemd-run --property=MemoryMax=256M ./cmd — creates a transient unit. |
mask vs disable? |
disable removes boot symlink. mask symlinks to /dev/null — blocks starting by any means. |
Exercises¶
Exercise 1: Read a crash loop (2 minutes)¶
Pick a failed unit. Run systemctl status <unit> and journalctl -u <unit> -n 30. Can you
identify the exit code and root cause?
What to look for
Common exit codes: `203/EXEC` (binary not found), `217/USER` (user doesn't exist), `226/NAMESPACE` (sandboxing failed), `1/FAILURE` (generic app error — check app logs).Exercise 2: Write a timer (10 minutes)¶
Replace this crontab with a systemd timer:
Requirements: Type=oneshot, persistent, 60-second random delay, 128M memory limit.
Solution
# /etc/systemd/system/disk-check.service
[Unit]
Description=Check disk space
[Service]
Type=oneshot
ExecStart=/usr/local/bin/check-disk-space.sh
MemoryMax=128M
Exercise 3: Security audit (15 minutes)¶
Pick three services and check their security scores:
for svc in sshd nginx postgresql; do
echo "=== $svc ==="
systemd-analyze security "$svc" 2>/dev/null | tail -1
done
Which scores worst? Write a drop-in override adding ProtectSystem=strict,
NoNewPrivileges=true, and PrivateTmp=true. Restart. Does it still work? If not, what
ReadWritePaths= does it need?
Cheat Sheet¶
Service lifecycle¶
| Command | Effect |
|---|---|
systemctl start/stop/restart <unit> |
Control running state |
systemctl reload <unit> |
Send SIGHUP (re-read config, no downtime) |
systemctl enable --now <unit> |
Start now + start on boot |
systemctl mask <unit> |
Prevent starting by any means |
systemctl daemon-reload |
Re-read unit files from disk |
systemctl reset-failed <unit> |
Clear "failed" state |
Diagnostics¶
| Command | Shows |
|---|---|
systemctl status <unit> |
State, PID, memory, recent logs |
systemctl cat <unit> |
Effective unit file with overrides |
systemctl list-units --failed |
All failed units |
journalctl -u <unit> -f |
Live logs |
journalctl -u <unit> -p err --since "1h ago" |
Recent errors |
journalctl -b -1 |
Previous boot |
systemd-analyze security <unit> |
Security score |
systemd-analyze blame |
Slowest boot units |
systemd-cgtop |
Live cgroup resource monitor |
Resource directives¶
| Directive | On exceed |
|---|---|
MemoryMax= |
SIGKILL (OOM) |
MemoryHigh= |
Kernel throttles |
CPUQuota= |
Throttled |
TasksMax= |
Fork fails |
RuntimeMaxSec= |
Graceful restart |
Security directives¶
| Directive | Effect |
|---|---|
ProtectSystem=strict |
Filesystem read-only except ReadWritePaths= |
ProtectHome=true |
/home, /root inaccessible |
PrivateTmp=true |
Isolated /tmp |
NoNewPrivileges=true |
No privilege escalation |
Takeaways¶
-
systemctl statusfirst,journalctlsecond. The exit code narrows the problem space before you start reading logs. -
Requires=needsAfter=. Dependency without ordering means simultaneous startup. Always pair them. -
MemoryHigh+MemoryMax= graduated response. Soft throttle before hard kill. Never useMemoryMaxalone. -
Timers over cron, always. Persistent missed-run recovery, journal logging, resource limits, random delay. No good reason for new cron jobs.
-
Security hardening is free.
ProtectSystem=strict,NoNewPrivileges=true,PrivateTmp=trueon every service. Fix theReadWritePaths=errors that follow. -
daemon-reloadafter every unit file change. This will bite you exactly once.
Related Lessons¶
- The Hanging Deploy — processes, signals, systemd stop behavior, and
TimeoutStopSec - From Init Scripts to systemd — SysV init to Upstart to systemd, and why the controversy matters
- The Disk That Filled Up — journald storage limits, log rotation, the
/vardisaster - Out of Memory — cgroup OOM vs system OOM, the OOM killer's scoring algorithm