- linux
- l1
- runbook
- systemd --- Portal | Level: L1: Foundations | Topics: systemd | Domain: Linux
Runbook: Systemd Service Crash Loop¶
| Field | Value |
|---|---|
| Domain | Linux |
| Alert | Service unit repeatedly restarting, or monitoring shows service down |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | SSH access to the node, sudo or root access |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
systemctl status <SERVICE_NAME> && journalctl -u <SERVICE_NAME> -n 50
active (running) with recent restarts → Service is currently up but unstable, check logs for error pattern
If output shows: failed or activating repeatedly → Service is crash looping, continue from Step 1
Step 1: Check Service Status¶
Why: systemctl status gives you the current state, the last few log lines, the restart count, and the PID — all in one command. This is your starting point.
# Full status including resource usage and last log lines
systemctl status <SERVICE_NAME> -l
# List all failed units to check for related failures
systemctl list-units --state=failed
# Check how many times the service has restarted
systemctl show <SERVICE_NAME> --property=NRestarts
# Check if the service hit the restart rate limit (StartLimitBurst)
systemctl show <SERVICE_NAME> --property=StartLimitBurst,StartLimitIntervalSec,ActiveState
● myservice.service - My Application Service
Loaded: loaded (/etc/systemd/system/myservice.service; enabled)
Active: failed (Result: exit-code) since 2026-03-19 10:00:00 UTC
Process: 12345 ExecStart=/usr/bin/myapp (code=exited, status=1/FAILURE)
Main PID: 12345 (code=exited, status=1/FAILURE)
NRestarts=5
systemctl status shows not-found, the service name is wrong or the unit file is missing. Check with systemctl list-units | grep <PARTIAL_NAME>.
Step 2: Read Recent Journal Logs for the Service¶
Why: The exit code and last few lines of output almost always contain the actual error. This is the most important step — do not skip it.
# Show the last 100 lines of logs for the service
journalctl -u <SERVICE_NAME> -n 100 --no-pager
# Show logs since the last restart
journalctl -u <SERVICE_NAME> --since "10 minutes ago" --no-pager
# Show logs in reverse order (most recent first) to find the crash quickly
journalctl -u <SERVICE_NAME> -r -n 50 --no-pager
# If the service logs to a separate file
sudo tail -100 /var/log/<SERVICE_LOG_FILE>
Mar 19 10:00:05 myhost myapp[12345]: Starting application...
Mar 19 10:00:06 myhost myapp[12345]: ERROR: Failed to connect to database: connection refused
Mar 19 10:00:06 myhost myapp[12345]: Fatal error, exiting.
Mar 19 10:00:06 myhost systemd[1]: myservice.service: Main process exited, code=exited, status=1/FAILURE
StandardOutput= and StandardError= settings in the unit file: systemctl cat <SERVICE_NAME>.
Step 3: Check Exit Code and Signal¶
Why: The exit code and termination signal tell you the category of failure — config error (exit 1), segfault (signal 11), OOM kill (signal 9), or permission denied (exit 13).
# Check the exit code from the most recent run
systemctl show <SERVICE_NAME> --property=ExecMainStatus
# Check if it was killed by a signal
journalctl -u <SERVICE_NAME> -n 10 | grep -i "signal\|killed\|exit-code"
# Decode common exit codes
# exit 1: General error (check logs)
# exit 2: Misuse of shell built-in (often script error)
# exit 126: Permission denied or not executable
# exit 127: Command not found (binary missing or wrong path)
# exit 130: SIGINT (Ctrl-C equivalent)
# exit 137: SIGKILL (OOM or manual kill)
# exit 139: SIGSEGV (segmentation fault — binary bug)
# exit 143: SIGTERM (graceful termination requested)
ExecStart= is wrong or the binary does not exist. Verify: ls -lh $(systemctl show <SERVICE_NAME> --property=ExecStart --value | awk '{print $1}').
Step 4: Try Manual Start to Reproduce the Error¶
Why: Running the command manually (outside of systemd) removes systemd's environment isolation and often produces more visible error output, especially for config and permission errors.
# Get the actual command that systemd runs
systemctl cat <SERVICE_NAME> | grep ExecStart
# Try running it manually as the service user
sudo -u <SERVICE_USER> <EXACT_EXECSTART_COMMAND>
# Or run it directly with more verbose output
sudo -u <SERVICE_USER> <BINARY_PATH> --verbose 2>&1
# Check if the working directory and environment are correct
systemctl show <SERVICE_NAME> --property=WorkingDirectory,User,Group,Environment
sudo -u <SERVICE_USER> env <ENVIRONMENT_VARIABLES> <BINARY_PATH>
# The error should now be visible in your terminal, e.g.:
Error: config file /etc/myapp/config.yaml: no such file or directory
journalctl -u <SERVICE_NAME> | grep -i avc\|denied for security policy blocks.
Step 5: Fix the Config or Binary Issue¶
Why: The fix depends entirely on what Step 4 revealed. Common fixes are documented here for the most frequent failure modes.
# Missing config file — restore or recreate it
sudo cp /etc/myapp/config.yaml.bak /etc/myapp/config.yaml
sudo systemctl daemon-reload # Only needed if unit file changed
# Permission denied on binary or config
sudo chmod +x <BINARY_PATH>
sudo chown <SERVICE_USER>:<SERVICE_GROUP> <CONFIG_PATH>
# Missing dependency (database, other service)
# Check if the dependency is running:
systemctl status <DEPENDENCY_SERVICE>
# Fix the unit file to wait for the dependency:
sudo systemctl edit <SERVICE_NAME>
# Add: [Unit]
# After=<DEPENDENCY_SERVICE>.service
# Requires=<DEPENDENCY_SERVICE>.service
# Wrong environment variable
sudo systemctl edit <SERVICE_NAME>
# Add: [Service]
# Environment="MY_VAR=correct_value"
# EnvironmentFile=/etc/myapp/env
sudo systemctl daemon-reload
sudo apt-get install --reinstall <PACKAGE_NAME> or re-pull from your artifact repository.
Step 6: Clear Failed State and Restart¶
Why: When a service hits its restart rate limit (StartLimitBurst), systemd puts it into a permanent failed state and will not attempt further restarts — even after you fix the underlying problem. You must reset this manually.
# Check if service is in failed state due to restart rate limit
systemctl is-failed <SERVICE_NAME>
# Reset the failed state
sudo systemctl reset-failed <SERVICE_NAME>
# Now start the service
sudo systemctl start <SERVICE_NAME>
# Watch the service status for 30 seconds to confirm it stays up
watch -n 2 systemctl status <SERVICE_NAME>
# Enable the service to start on boot if not already enabled
sudo systemctl enable <SERVICE_NAME>
● myservice.service - My Application Service
Active: active (running) since 2026-03-19 10:15:00 UTC; 30s ago
Main PID: 13000 (myapp)
systemctl start, return to Step 4 — the root cause was not fully resolved. Do not loop systemctl restart without reading logs each time.
Verification¶
# Confirm the issue is resolved
systemctl is-active <SERVICE_NAME> && journalctl -u <SERVICE_NAME> -n 20 --no-pager
systemctl is-active returns active. Journal shows clean startup with no errors. Monitoring alert has cleared.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Application team on-call | "Service |
| Data loss suspected | Application team lead | "Service |
| Scope expanding to multiple nodes | SRE lead | "Service |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Using
systemctl restartwithout clearingfailedstate first: If the service has hit itsStartLimitBurst,systemctl restartdoes nothing — systemd refuses to start a unit infailedstate from rate limiting. You must runsystemctl reset-failed <SERVICE_NAME>first, or the restart attempt is silently ignored. - Not reading the actual error in journalctl: It is tempting to jump straight to
systemctl restartwithout checking logs. This wastes time (service crashes again) and loses context. Always read the logs first — the error is almost always there. - Reloading systemd daemon unnecessarily:
systemctl daemon-reloadis only needed when a unit file on disk has changed. Running it when debugging a service that has not had its unit file changed has no effect and wastes time. Run it once after editing a unit file, not after every failed restart attempt.
Cross-References¶
- Topic Pack: Systemd and Service Management (deep background)
- Related Runbook: Zombie Processes Accumulating
Wiki Navigation¶
Related Content¶
- Case Study: Systemd Service Flapping (Case Study, L1) — systemd
- Cron & Job Scheduling (Topic Pack, L1) — systemd
- Deep Dive: Linux Boot Sequence (deep_dive, L2) — systemd
- Deep Dive: Systemd Architecture (deep_dive, L2) — systemd
- Deep Dive: Systemd Service Design Debugging and Hardening (deep_dive, L2) — systemd
- Deep Dive: Systemd Timers Journald Cgroups and Resource Control (deep_dive, L2) — systemd
- Deep Dive: Systemd Units Dependencies and Ordering (deep_dive, L2) — systemd
- LPIC / LFCS Exam Preparation (Topic Pack, L2) — systemd
- Linux Boot Process (Topic Pack, L1) — systemd
- Linux Logging (Topic Pack, L1) — systemd
Pages that link here¶
- Cron & Job Scheduling
- Linux Boot Process
- Linux Boot Sequence - From Power-On to Full Boot
- Linux Logging
- Linux Ops: systemd
- Operational Runbooks
- Runbook: Zombie Processes Accumulating
- Symptoms
- systemd Architecture
- systemd Service Design, Debugging, and Hardening
- systemd Timers, Journald, cgroups, and Resource Control
- systemd Units, Dependencies, and Ordering