Skip to content

systemd Service Design, Debugging, and Hardening

Scope

This document focuses on .service units as they are used in real servers:

  • service types
  • lifecycle state machine
  • restart behavior
  • notifications
  • watchdogs
  • timeouts
  • cleanup semantics
  • sandboxing/hardening
  • debugging broken units
  • writing units that do not behave like cursed shell wrappers

Reference anchors: - https://www.freedesktop.org/software/systemd/man/systemd.service.html - https://www.freedesktop.org/software/systemd/man/systemd.unit.html - https://www.freedesktop.org/software/systemd/man/systemd-analyze.html - https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html


Big Picture

A service unit is a contract between PID 1 and a workload.

The contract answers:

  • how to start it
  • when it is considered ready
  • how to stop it
  • how to restart it
  • what to do if it crashes
  • what resources and privileges it gets
  • how its logs are captured
  • how its children are tracked

If you cannot answer those cleanly, your unit file is probably bad.


Pick the Right Type=

This is where a lot of people step on landmines.

Type=simple

The default. systemd considers the service started immediately after spawning the main process.

Use for: - foreground daemons - most modern services - things that do not need readiness protocol

Type=exec

Like simple but start is considered successful only after execve() succeeds. Often a better default than people realize.

Type=forking

For classic daemons that fork into the background. Legacy-heavy. Often used because of habit, not because it is correct.

Type=oneshot

Run a task and exit. Often paired with RemainAfterExit=yes when the effect, not the process, is the "state."

Type=notify

Service explicitly tells systemd when it is ready using sd_notify. Best for services that need a real readiness point.

Type=dbus

Service readiness tied to D-Bus name acquisition.

Rule of thumb: if the daemon can stay in foreground, do that. Do not daemonize twice like a clown car.


Readiness Is Not Process Existence

A spawned process is not the same thing as a ready service.

Examples: - database process exists but still replaying logs - web server process exists but has not bound sockets - worker exists but has not loaded config

If readiness matters, use: - Type=notify - socket activation - or another explicit mechanism

Avoid pretending "pid exists" means "service healthy."


Exec Directives

Common ones:

  • ExecStart=
  • ExecStartPre=
  • ExecStartPost=
  • ExecReload=
  • ExecStop=
  • ExecStopPost=

Guidelines:

  1. keep them short
  2. avoid giant inline shell pipelines
  3. prefer dedicated scripts if logic is nontrivial
  4. know that failure in ExecStartPre= can stop the unit before main start
  5. understand that start/stop are part of one service state machine, not random shell hooks

Restart Policy

Important knobs:

Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=...
StartLimitBurst=...

Common values: - no - on-success - on-failure - on-abnormal - always

Think through semantics carefully.

Example: - Restart=always for a batch job is probably wrong - Restart=on-failure for a daemon is usually sane - combine with sane limits so crash loops do not melt the node


Timeouts and Stop Semantics

Relevant knobs: - TimeoutStartSec= - TimeoutStopSec= - KillMode= - KillSignal= - SendSIGKILL= - FinalKillSignal=

Because systemd tracks the whole cgroup, stop behavior is far better than old PID-file theater.

KillMode=control-group is usually what you want: stop the service, not one random ancestor process while children survive like cockroaches.


MainPID and Tracking

With cgroup-aware supervision, systemd can track the whole unit even if the original process exits and workers remain.

Still, MainPID matters for status and signal routing. Bad daemonization models and stale pidfiles are a recurring source of confusion.

This is another reason foreground-first designs are superior.


Environment and Credentials

Useful directives: - Environment= - EnvironmentFile= - WorkingDirectory= - User= - Group= - SupplementaryGroups=

Use these deliberately. Do not assume shell login environment semantics. Service environments are not your interactive shell.


Sandboxing / Hardening

This is one of systemd's best server-side features.

Examples: - NoNewPrivileges=yes - PrivateTmp=yes - ProtectSystem=strict - ProtectHome=yes - PrivateDevices=yes - ProtectKernelTunables=yes - ProtectControlGroups=yes - MemoryDenyWriteExecute=yes - CapabilityBoundingSet= - AmbientCapabilities= - SystemCallFilter= - RestrictAddressFamilies=

Real point: you can shrink blast radius even if the daemon is compromised.

Use systemd-analyze security as a starting lens, not holy scripture.


Watchdog

A Type=notify service can periodically ping systemd watchdog logic. If it stops pinging, PID 1 can restart it.

That protects against some "hung but not dead" failures, which plain restart-on-exit cannot detect.


Logging Model

stdout/stderr can go straight to journald. That means:

  • fewer pidfile-era logging hacks
  • structured metadata
  • per-unit logs
  • boot-scoped investigation

Typical commands:

journalctl -u myapp.service
journalctl -b -u myapp.service
journalctl -xeu myapp.service

Debugging a Failed Service

Use this workflow:

  1. systemctl status name.service
  2. journalctl -xeu name.service
  3. systemctl cat name.service
  4. systemctl show name.service
  5. inspect exit code, signal, timeout, readiness, permissions, paths, env
  6. if needed, run the underlying command manually as the target user/environment
  7. inspect cgroup/resource issues

Common actual causes: - wrong path - wrong user - missing runtime directory - daemon forks unexpectedly - service never signals readiness - SELinux/AppArmor policy issue - dependency graph wrong - start limit hit


Bad Patterns to Avoid

Giant bash -c in ExecStart=

You lose transparency, quoting gets cursed, and failure modes become muddy.

Type=forking by reflex

Legacy habit, not a design principle.

Backgrounding inside the service command

Do not fight the supervisor.

Using unit files as config-management junk drawers

Keep app config in app config where possible.

No hardening at all

A lot of services can safely lose privileges and capabilities.


Example of a Solid Modern Pattern

[Unit]
Description=Example API
After=network.target

[Service]
Type=notify
User=api
Group=api
WorkingDirectory=/srv/api
ExecStart=/usr/local/bin/example-api --config /etc/example-api/config.yaml
Restart=on-failure
RestartSec=3s
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes

[Install]
WantedBy=multi-user.target

This is usually better than daemonizing, pidfiles, wrapper scripts, and prayer.


Interview-Level Things to Explain

You should be able to explain:

  • why Type=notify exists
  • why Type=forking is legacy-heavy
  • how Restart= interacts with crash loops
  • how journald helps debugging
  • why cgroups make service supervision stronger
  • what hardening directives buy you
  • how to debug "service says active but app still broken"

Fast Mental Model

A good systemd service unit describes a workload's lifecycle, readiness, privileges, limits, and failure policy in a way PID 1 can supervise cleanly.

Wiki Navigation

Prerequisites