systemd Street Ops¶
Debugging Service Failures: The Flowchart¶
Service won't start
|
+-> systemctl status <unit>
| Look at: Active line, loaded line, last 10 log lines
|
+-> journalctl -u <unit> -n 50 --no-pager
| Look at: Error messages, exit codes, permission denied
|
+-> systemctl cat <unit>
| Look at: ExecStart path exists? User correct? WorkingDirectory exists?
|
+-> Is the binary executable? Does the user have permissions?
|
+-> Check dependencies: systemctl list-dependencies <unit>
|
+-> Try running ExecStart manually as the service user:
sudo -u <serviceuser> /path/to/binary --flags
Reading journalctl Effectively¶
Time-bounded queries save your life:
journalctl -u nginx --since "10 minutes ago"
journalctl -u nginx --since "2024-01-15 14:00" --until "2024-01-15 14:30"
Follow mode for live debugging:
Boot-specific logs:
journalctl -b # current boot
journalctl -b -1 # previous boot
journalctl --list-boots # see all stored boots
Priority filtering:
JSON output for parsing:
Disk usage:
Common Pitfalls¶
1. "I edited the unit file but nothing changed"¶
You forgot systemctl daemon-reload. systemd caches unit files in memory.
2. "Service starts then immediately dies"¶
- Check
Type=in the unit file. IfType=forking, the process must actually fork. If it doesn't fork, useType=simpleorType=exec. - Check if the process needs a config file or environment variable that's missing.
ExecStart=must use absolute paths. Relative paths silently fail.
3. "Service works manually but not under systemd"¶
- Environment is different. systemd provides a minimal environment. Add
Environment=orEnvironmentFile=directives. - Working directory is
/by default. SetWorkingDirectory=. - User is root by default unless
User=is set. But maybe you tested as a different user. - SELinux/AppArmor context may differ.
4. "Service keeps restarting in a loop"¶
Check Restart= and RestartSec= directives. Look at systemctl show <unit> -p StartLimitBurst -p StartLimitIntervalSec. systemd rate-limits restarts. Once you hit the limit, the unit enters "failed" state.
Reset with: systemctl reset-failed <unit>
5. "Timer fires at wrong time"¶
OnCalendar=uses a specific format. Validate with:systemd-analyze calendar "Mon *-*-* 02:00:00"- Timers using
OnBootSec=orOnUnitActiveSec=are monotonic, not wall-clock. Persistent=truemakes the timer fire on next boot if it was missed while the system was off.
6. "Can't see logs for a service"¶
- Service might write to a file, not stdout/stderr. Check
ExecStart=for log-file redirects. StandardOutput=journalandStandardError=journalare defaults, but may be overridden.- Journal storage might be volatile (
/run/log/journal/) instead of persistent (/var/log/journal/). Check/etc/systemd/journald.confforStorage=.
Writing Override Files¶
Add environment variable without touching the unit:
systemctl edit myapp
# Opens editor. Add:
[Service]
Environment="DATABASE_URL=postgres://localhost/mydb"
Change resource limits:
Change restart behavior:
systemctl edit myapp
[Service]
Restart=on-failure
RestartSec=5
StartLimitBurst=5
StartLimitIntervalSec=60
Important: Drop-in overrides for [Service] section are additive for some directives but replacing for others. Environment= is additive. ExecStart= must be cleared first with an empty ExecStart= line, then set to the new value.
Boot Order Debugging¶
systemd-analyze # total boot time
systemd-analyze blame # time per unit
systemd-analyze critical-chain # dependency chain with timing
systemd-analyze critical-chain nginx.service # chain for specific unit
systemd-analyze plot > boot.svg # visual boot chart
Slice and Scope Management (cgroups)¶
View the cgroup tree:
See resource usage per slice:
Custom slice for resource isolation:
Then in your service:
Transient cgroup control (no unit file needed):
Decision Tree: Which Type= Do I Need?¶
Does the process fork into the background?
Yes -> Type=forking (set PIDFile= if possible)
No -> Does it signal readiness via sd_notify()?
Yes -> Type=notify
No -> Does it set up sockets then exit?
Yes -> Type=oneshot (with RemainAfterExit=yes if needed)
No -> Type=exec (or Type=simple)
Use Type=exec over Type=simple when possible -- exec waits for the binary to actually execute (catches missing binary errors), while simple considers the unit started as soon as fork() returns.
Failure Modes You Must Recognize¶
| Symptom | Likely Cause |
|---|---|
(code=exited, status=203/EXEC) |
Binary not found or not executable |
(code=exited, status=217/USER) |
User specified in User= doesn't exist |
(code=exited, status=226/NAMESPACE) |
PrivateTmp=, ProtectSystem=, or other sandboxing failed |
(code=killed, signal=KILL) |
OOM killer or MemoryMax= limit hit |
(code=killed, signal=ABRT) |
Process aborted itself (crash) |
Start request repeated too quickly |
Hit StartLimitBurst -- too many restarts |
Unit entered failed state |
Check journalctl -u <unit> for root cause |
Heuristics¶
- Always check
systemctl statusbeforejournalctl. The status output gives you the exit code, which narrows the problem space immediately. - When debugging timer issues, check both the timer unit and the service unit. The timer triggers the service -- if the service fails, the timer still shows as active.
systemctl maskis stronger thandisable. Mask symlinks the unit to/dev/null. Use it to prevent a unit from ever starting, even as a dependency.systemctl list-units --failedis your "what's broken right now" command.- Watch out for socket activation. If a
.socketunit exists, the service may start on first connection, not at boot.systemctl stop myappmight not stop the socket -- the service will restart on next connection.
Power One-Liners¶
See reboot history with timestamps¶
[!TIP] When to use: "When did this box last reboot?" — for correlating with incidents.
Quick Reference¶
- Cheatsheet: Systemd
- Deep Dive: Systemd Architecture
- Deep Dive: Systemd Service Design Debugging And Hardening
- Deep Dive: Systemd Timers Journald Cgroups And Resource Control
- Deep Dive: Systemd Units Dependencies And Ordering