Skip to content

systemd Street Ops

Debugging Service Failures: The Flowchart

Service won't start
  |
  +-> systemctl status <unit>
  |     Look at: Active line, loaded line, last 10 log lines
  |
  +-> journalctl -u <unit> -n 50 --no-pager
  |     Look at: Error messages, exit codes, permission denied
  |
  +-> systemctl cat <unit>
  |     Look at: ExecStart path exists? User correct? WorkingDirectory exists?
  |
  +-> Is the binary executable? Does the user have permissions?
  |
  +-> Check dependencies: systemctl list-dependencies <unit>
  |
  +-> Try running ExecStart manually as the service user:
        sudo -u <serviceuser> /path/to/binary --flags

Reading journalctl Effectively

Time-bounded queries save your life:

journalctl -u nginx --since "10 minutes ago"
journalctl -u nginx --since "2024-01-15 14:00" --until "2024-01-15 14:30"

Follow mode for live debugging:

journalctl -u myapp -f

Boot-specific logs:

journalctl -b        # current boot
journalctl -b -1     # previous boot
journalctl --list-boots  # see all stored boots

Priority filtering:

journalctl -p err     # errors and above only
journalctl -p warning -u nginx

JSON output for parsing:

journalctl -u myapp -o json-pretty -n 5

Disk usage:

journalctl --disk-usage
journalctl --vacuum-size=500M
journalctl --vacuum-time=7d

Common Pitfalls

1. "I edited the unit file but nothing changed"

You forgot systemctl daemon-reload. systemd caches unit files in memory.

2. "Service starts then immediately dies"

  • Check Type= in the unit file. If Type=forking, the process must actually fork. If it doesn't fork, use Type=simple or Type=exec.
  • Check if the process needs a config file or environment variable that's missing.
  • ExecStart= must use absolute paths. Relative paths silently fail.

3. "Service works manually but not under systemd"

  • Environment is different. systemd provides a minimal environment. Add Environment= or EnvironmentFile= directives.
  • Working directory is / by default. Set WorkingDirectory=.
  • User is root by default unless User= is set. But maybe you tested as a different user.
  • SELinux/AppArmor context may differ.

4. "Service keeps restarting in a loop"

Check Restart= and RestartSec= directives. Look at systemctl show <unit> -p StartLimitBurst -p StartLimitIntervalSec. systemd rate-limits restarts. Once you hit the limit, the unit enters "failed" state.

Reset with: systemctl reset-failed <unit>

5. "Timer fires at wrong time"

  • OnCalendar= uses a specific format. Validate with: systemd-analyze calendar "Mon *-*-* 02:00:00"
  • Timers using OnBootSec= or OnUnitActiveSec= are monotonic, not wall-clock.
  • Persistent=true makes the timer fire on next boot if it was missed while the system was off.

6. "Can't see logs for a service"

  • Service might write to a file, not stdout/stderr. Check ExecStart= for log-file redirects.
  • StandardOutput=journal and StandardError=journal are defaults, but may be overridden.
  • Journal storage might be volatile (/run/log/journal/) instead of persistent (/var/log/journal/). Check /etc/systemd/journald.conf for Storage=.

Writing Override Files

Add environment variable without touching the unit:

systemctl edit myapp
# Opens editor. Add:
[Service]
Environment="DATABASE_URL=postgres://localhost/mydb"

Change resource limits:

systemctl edit myapp
[Service]
LimitNOFILE=65536
MemoryMax=2G
CPUQuota=150%

Change restart behavior:

systemctl edit myapp
[Service]
Restart=on-failure
RestartSec=5
StartLimitBurst=5
StartLimitIntervalSec=60

Important: Drop-in overrides for [Service] section are additive for some directives but replacing for others. Environment= is additive. ExecStart= must be cleared first with an empty ExecStart= line, then set to the new value.

[Service]
ExecStart=
ExecStart=/usr/bin/myapp --new-flags

Boot Order Debugging

systemd-analyze                          # total boot time
systemd-analyze blame                    # time per unit
systemd-analyze critical-chain           # dependency chain with timing
systemd-analyze critical-chain nginx.service  # chain for specific unit
systemd-analyze plot > boot.svg          # visual boot chart

Slice and Scope Management (cgroups)

View the cgroup tree:

systemd-cgls
systemctl status            # shows the tree too

See resource usage per slice:

systemd-cgtop

Custom slice for resource isolation:

# /etc/systemd/system/heavywork.slice
[Slice]
MemoryMax=4G
CPUQuota=200%

Then in your service:

[Service]
Slice=heavywork.slice

Transient cgroup control (no unit file needed):

systemd-run --scope --slice=heavywork.slice -p MemoryMax=1G ./myscript.sh

Decision Tree: Which Type= Do I Need?

Does the process fork into the background?
  Yes -> Type=forking (set PIDFile= if possible)
  No  -> Does it signal readiness via sd_notify()?
           Yes -> Type=notify
           No  -> Does it set up sockets then exit?
                    Yes -> Type=oneshot (with RemainAfterExit=yes if needed)
                    No  -> Type=exec (or Type=simple)

Use Type=exec over Type=simple when possible -- exec waits for the binary to actually execute (catches missing binary errors), while simple considers the unit started as soon as fork() returns.

Failure Modes You Must Recognize

Symptom Likely Cause
(code=exited, status=203/EXEC) Binary not found or not executable
(code=exited, status=217/USER) User specified in User= doesn't exist
(code=exited, status=226/NAMESPACE) PrivateTmp=, ProtectSystem=, or other sandboxing failed
(code=killed, signal=KILL) OOM killer or MemoryMax= limit hit
(code=killed, signal=ABRT) Process aborted itself (crash)
Start request repeated too quickly Hit StartLimitBurst -- too many restarts
Unit entered failed state Check journalctl -u <unit> for root cause

Heuristics

  1. Always check systemctl status before journalctl. The status output gives you the exit code, which narrows the problem space immediately.
  2. When debugging timer issues, check both the timer unit and the service unit. The timer triggers the service -- if the service fails, the timer still shows as active.
  3. systemctl mask is stronger than disable. Mask symlinks the unit to /dev/null. Use it to prevent a unit from ever starting, even as a dependency.
  4. systemctl list-units --failed is your "what's broken right now" command.
  5. Watch out for socket activation. If a .socket unit exists, the service may start on first connection, not at boot. systemctl stop myapp might not stop the socket -- the service will restart on next connection.

Power One-Liners

See reboot history with timestamps

journalctl --list-boots

[!TIP] When to use: "When did this box last reboot?" — for correlating with incidents.


Quick Reference