linux
l1
topic-pack
linux-ops-systemd --- Portal | Level: L1: Foundations | Topics: Linux Ops systemd | Domain: Linux

Linux Ops: systemd - Primer¶

Why This Matters¶

systemd is PID 1 on every major Linux distribution. It controls which services start, in what order, how they restart on failure, and how resources are limited.

Fun fact: systemd was created by Lennart Poettering and Kay Sievers at Red Hat, first released in 2010. It replaced SysVinit (1983) and Upstart (2006). The name is intentionally lowercase — "system daemon." It was controversial because it replaced simple shell scripts with a complex binary system, but it won because of parallel boot, dependency resolution, and cgroup integration. By 2015, every major distro had adopted it.

Most engineers know systemctl start and stop. But production incidents demand more: reading structured logs, understanding dependency chains that cause cascade failures, writing custom unit files, and setting resource limits to prevent runaway processes.

Core Concepts¶

1. Unit Files¶

Everything in systemd is a unit. Main types:

Type	Purpose	Example
service	Daemons and processes	nginx.service
timer	Scheduled execution	backup.timer
socket	Socket activation	cups.socket
target	Grouping/ordering	multi-user.target

Unit file locations (highest priority first):

/etc/systemd/system/       # Admin overrides
/run/systemd/system/       # Runtime (transient)
/usr/lib/systemd/system/   # Vendor defaults

Never edit vendor files. Use systemctl edit <unit> for drop-in overrides.

Remember: Unit file location priority: "ERC" — /Etc (admin overrides) > /Run (runtime transient) > /usr/lib (Core vendor). When you systemctl edit nginx, it creates a drop-in at /etc/systemd/system/nginx.service.d/override.conf. This survives package upgrades because vendor updates only touch /usr/lib/. If you edit vendor files directly, your changes are overwritten on the next package update.

2. Essential systemctl Commands¶

systemctl start|stop|restart|reload nginx
systemctl enable --now nginx   # Start + boot persist
systemctl status nginx         # State + recent logs
systemctl is-active nginx      # Quick health check
systemctl list-units --failed  # All failed units
systemctl list-timers          # Active timers
systemctl daemon-reload        # After editing units

3. journalctl Log Queries¶

journalctl -u nginx -f          # Follow logs
journalctl -u nginx -n 100      # Last 100 lines
journalctl -u nginx -b          # Since boot
journalctl -u nginx -p err      # Error+ priority
journalctl -u nginx \
  --since "2024-01-15 10:00" \
  --until "2024-01-15 11:00"
journalctl -k                   # Kernel messages
journalctl -o json --no-pager   # JSON for scripting
journalctl --disk-usage         # Log disk usage
journalctl --vacuum-time=7d     # Prune old logs

4. Service Dependencies¶

[Unit]
After=network-online.target postgresql.service
Requires=postgresql.service
Wants=redis.service

Directive	Meaning
After=	Start ordering (not dependency)
Requires=	Hard dep: if it fails, we fail
Wants=	Soft dep: if it fails, we continue
BindsTo=	Like Requires + stop when it stops

After= is ordering only. Requires= is dependency only. You almost always need both together.

Gotcha: Requires=postgresql.service without After=postgresql.service starts both units simultaneously. Your app may start before PostgreSQL is ready, crash, and enter a restart loop. Always pair Requires= with After= for services that have startup-order dependencies. The Wants= + After= combo is preferred for soft dependencies where the dependency failing should not take down your service.

systemd-analyze critical-chain nginx.service

5. Restart Policies¶

[Service]
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=300
StartLimitBurst=5

Restart=	When it restarts
no	Never (default)
always	Regardless of exit code
on-failure	Non-zero exit or signal
on-abnormal	Signal, timeout, or watchdog

StartLimitBurst/StartLimitIntervalSec prevent crash loops (e.g., 5 restarts in 300s = give up).

6. Resource Limits via cgroups¶

Under the hood: systemd uses Linux cgroups (control groups) v2 to enforce resource limits. Each service runs in its own cgroup at /sys/fs/cgroup/system.slice/<service>.service/. MemoryMax sets a hard limit — the OOM killer fires when exceeded. MemoryHigh is a soft limit — the kernel aggressively reclaims memory but does not kill the process. Use MemoryHigh as an early warning and MemoryMax as the kill fence.

[Service]
CPUQuota=50%
MemoryMax=512M
MemoryHigh=384M
TasksMax=512
LimitNOFILE=65535

systemctl show nginx -p MemoryCurrent
systemd-cgtop  # Live cgroup monitor

7. Timer Units vs Cron¶

Service unit (the job):

# /etc/systemd/system/backup.service
[Unit]
Description=Database Backup
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

Timer unit (the schedule):

# /etc/systemd/system/backup.timer
[Unit]
Description=Run backup daily at 2am
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=300
[Install]
WantedBy=timers.target

Advantages over cron: journalctl logging, Persistent=true runs missed jobs, randomized delay prevents thundering herd, resource limits apply.

Remember: Timer vs cron advantages mnemonic: "PLRR" — Persistent (runs missed jobs), Logging (journalctl built-in), Randomized delay (no thundering herd), Resource limits (cgroup controls). Cron has none of these. For new scheduled jobs, always prefer systemd timers.

8. Creating Custom Service Units¶

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application Server
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/myapp --config /etc/myapp.conf
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
User=myapp
Group=myapp
WorkingDirectory=/var/lib/myapp
EnvironmentFile=/etc/myapp/env
MemoryMax=1G
ProtectSystem=strict
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

9. Debugging Failed Services¶

Debug clue: When a service fails and journalctl -u myapp shows nothing useful, try running the exact ExecStart command manually as the service user: sudo -u myapp /usr/local/bin/myapp --config /etc/myapp.conf. Services often fail because of permission issues, missing environment variables, or working directory problems that are invisible in the journal but obvious when run interactively.

systemctl status myapp            # State + logs
journalctl -u myapp -b            # Full boot logs
systemctl show myapp -p NRestarts # Crash loop?
systemd-analyze verify myapp.service  # Syntax check
# Run manually as the service user:
sudo -u myapp /usr/local/bin/myapp --config /etc/myapp.conf

What Experienced People Know¶

systemctl daemon-reload after any manual unit file edit. Forgetting it is the most common mistake.
Restart=always without StartLimitBurst creates infinite crash loops that flood logs.
Type=forking is legacy. Prefer simple or notify for new services.
Use ProtectSystem=strict and NoNewPrivileges=true for free security hardening.
systemd-analyze blame shows slow-starting services. Essential for boot optimization.
Check TimeoutStopSec= if a service will not stop. Default is 90 seconds.
ExecStartPre= runs before the main process. Use it for config validation or directory creation.
Set SystemMaxUse= in /etc/systemd/journald.conf to prevent /var/log from filling your disk.