systemd Footguns¶

Mistakes that cause boot loops, masked services, and units that silently refuse to start.

1. Editing vendor unit files directly instead of using overrides¶

You edit /usr/lib/systemd/system/nginx.service to change the ExecStart command. Next package update overwrites your change. Nginx starts with the old config. You have no idea why your custom flag disappeared.

Why people do it: The vendor file is right there. It is the file that systemctl cat shows. Editing it feels direct and correct.

Fix: Never edit files in /usr/lib/systemd/system/. Use systemctl edit nginx to create a drop-in override at /etc/systemd/system/nginx.service.d/override.conf. To replace the entire unit, use systemctl edit --full nginx. Overrides survive package updates.

Under the hood: systemd reads unit files in precedence order: /etc/systemd/system/ (admin overrides, highest) > /run/systemd/system/ (runtime) > /usr/lib/systemd/system/ (vendor, lowest). Drop-in files in .d/ directories are merged on top of the base unit. systemctl cat nginx shows the effective configuration with all overrides applied.

2. Forgetting `daemon-reload` after editing unit files¶

You create or modify a unit file. You run systemctl restart myapp. It restarts with the old configuration. You stare at the file, confirm your changes are there, restart again. Same result. systemd is still using the cached version of the unit file.

Why people do it: Every other config-file change takes effect on service restart. systemd is the exception -- it caches unit definitions and requires an explicit reload of the daemon.

Fix: After any unit file change: systemctl daemon-reload then systemctl restart <unit>. Make it muscle memory. If you are scripting deployments, always include daemon-reload before restart.

3. Masking a service and forgetting about it¶

You run systemctl mask nginx during an incident to prevent it from starting. The incident resolves. Six months later, someone tries to start nginx and it fails with "Unit nginx.service is masked." Nobody remembers masking it. Masking creates a symlink to /dev/null that survives reboots, re-enables, and package reinstalls.

Why people do it: Masking is the nuclear option to prevent a service from running. It works too well. Unlike disable, it cannot be overridden by dependencies.

Fix: Use disable instead of mask unless you have a very specific reason. If you do mask, document it in your runbook and set a calendar reminder. To find masked units: systemctl list-unit-files --state=masked. Unmask with systemctl unmask <unit>.

Debug clue: If systemctl start <service> fails with "Unit is masked," the unit file is a symlink to /dev/null. Check with ls -la /etc/systemd/system/<service>.service. This is easy to forget about because systemctl enable <service> also fails silently — it says "Created symlink" but the mask takes precedence.

4. Using `Type=simple` for a service that forks¶

Your service forks a child process and the parent exits (like a traditional daemon). You set Type=simple. systemd considers the service "started" when the parent process begins, then the parent exits, and systemd marks the service as "failed" because PID 1 of the cgroup died. The child is running fine but systemd does not know about it.

Why people do it: Type=simple is the default and works for most modern services. People do not check whether the binary forks or stays in foreground.

Fix: Match the Type to the service behavior. If the service stays in foreground: Type=simple. If it forks: Type=forking with a PIDFile=. If it notifies systemd when ready: Type=notify. Check the service documentation or run it manually to see if it forks.

5. Setting `Restart=always` without rate limiting¶

Your service crashes on startup due to a config error. systemd restarts it. It crashes again. Restarts again. In a tight loop, systemd spawns thousands of processes per minute, each writing crash logs, consuming CPU, and polluting journal storage.

Why people do it: Restart=always is the recommended resilience setting. Without rate limiting, it is also a fork bomb with extra steps.

Fix: Always pair Restart=always with RestartSec=5 (or higher) and keep the defaults for StartLimitIntervalSec and StartLimitBurst. The defaults (5 starts in 10 seconds) provide reasonable protection, but verify they are not overridden in your unit file.

Gotcha: When the start rate limit is hit, the unit enters a "failed" state and stops restarting. systemctl reset-failed <unit> clears this, but the underlying config error is still there. Some teams add StartLimitIntervalSec=0 to disable rate limiting entirely — this turns a crash loop into an actual fork bomb. Never disable rate limiting.

6. Using `KillMode=none` to "fix" stop behavior¶

Your service does not stop cleanly. You set KillMode=none so systemd stops trying to kill it. Now when you systemctl stop myapp, the old process keeps running. You systemctl start myapp and now two instances are running, fighting over the same port, file locks, or data.

Why people do it: The service has a slow shutdown sequence. systemd's default TimeoutStopSec=90s is not enough, and KillMode=control-group kills the process "too aggressively."

Fix: Increase TimeoutStopSec to give the service time to shut down gracefully. Use ExecStop= to send a custom shutdown command. If the service truly needs KillMode=none, use ExecStop= to guarantee the old process is gone before systemd considers the stop complete.

7. Putting critical dependency in `Wants=` instead of `Requires=`¶

Your application needs the database to be running. You write Wants=postgresql.service. PostgreSQL fails to start. systemd starts your application anyway because Wants= is a soft dependency -- it tries to start the wanted unit but does not fail if it cannot. Your application crashes with "connection refused."

Why people do it: Wants= is recommended over Requires= in most documentation because hard dependencies create cascading failures. But for genuine dependencies (app needs database), soft deps hide the real problem.

Fix: Use Requires= and After= together for hard dependencies: the service will not start if the required unit fails, and After= ensures correct ordering. Reserve Wants= for optional enhancements (metrics collector, log shipper).

8. Not checking journal storage limits¶

journald stores logs persistently in /var/log/journal/. You never configure size limits. Over months, journal storage grows to 30GB. The /var partition fills up. Services that need to write to /var start failing. Including journald itself, so you lose the ability to see what is happening.

Why people do it: journald manages its own storage with default limits (10% of filesystem, max 4GB). But these defaults assume a normally-sized /var. On a small partition, 10% is too much. On a large partition with verbose services, even 4GB fills up.

Fix: Set explicit limits in /etc/systemd/journald.conf: SystemMaxUse=2G and RuntimeMaxUse=500M. Run journalctl --disk-usage to check current size. Vacuum old entries: journalctl --vacuum-size=1G or --vacuum-time=7d.

9. Creating circular dependencies with `After=` and `Before=`¶

Service A has After=B.service. Service B has After=A.service. Neither can start because each waits for the other. systemd detects this and breaks the cycle -- but the resolution is unpredictable. One service starts first, the other may or may not start, and the order changes between boots.

Why people do it: Complex applications have many interdependencies. When adding ordering constraints, people do not check the reverse direction. The cycle is only visible when you map the full dependency graph.

Fix: Map dependencies before adding ordering: systemd-analyze dot <unit> | dot -Tsvg > deps.svg. Use systemd-analyze verify <unit> to check for errors. If services truly have circular dependencies, break the cycle with socket activation or a startup script that handles initialization order.

10. Running `systemctl enable` but not `systemctl start`¶

You deploy a new service. You run systemctl enable myapp. You check status: "inactive (dead)." You assume it will start on next boot and walk away. The service does not start until the next reboot. If something depends on it right now, it is missing.

Why people do it: enable sounds like it turns the service on. It does not -- it only creates the symlinks so the service starts at boot. start is the command that actually runs the service now.

Fix: Always systemctl enable --now <unit>, which enables and starts in one command. Or explicitly: systemctl enable myapp && systemctl start myapp. Verify with systemctl status myapp.

Remember: enable = start at boot (creates symlink in multi-user.target.wants/). start = start now. enable --now = both. disable = don't start at boot (removes symlink). stop = stop now. mask = prevent starting by any means (symlink to /dev/null). These are orthogonal axes: a service can be enabled but stopped, or disabled but running.

systemd Footguns¶

1. Editing vendor unit files directly instead of using overrides¶

2. Forgetting daemon-reload after editing unit files¶

3. Masking a service and forgetting about it¶

4. Using Type=simple for a service that forks¶

5. Setting Restart=always without rate limiting¶

6. Using KillMode=none to "fix" stop behavior¶

7. Putting critical dependency in Wants= instead of Requires=¶