Skip to content

Runbook: Systemd Service Crash Loop

Field Value
Domain Linux
Alert Service unit repeatedly restarting, or monitoring shows service down
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites SSH access to the node, sudo or root access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
systemctl status <SERVICE_NAME> && journalctl -u <SERVICE_NAME> -n 50
If output shows: active (running) with recent restarts → Service is currently up but unstable, check logs for error pattern If output shows: failed or activating repeatedly → Service is crash looping, continue from Step 1

Step 1: Check Service Status

Why: systemctl status gives you the current state, the last few log lines, the restart count, and the PID — all in one command. This is your starting point.

# Full status including resource usage and last log lines
systemctl status <SERVICE_NAME> -l

# List all failed units to check for related failures
systemctl list-units --state=failed

# Check how many times the service has restarted
systemctl show <SERVICE_NAME> --property=NRestarts

# Check if the service hit the restart rate limit (StartLimitBurst)
systemctl show <SERVICE_NAME> --property=StartLimitBurst,StartLimitIntervalSec,ActiveState
Expected output:
 myservice.service - My Application Service
     Loaded: loaded (/etc/systemd/system/myservice.service; enabled)
     Active: failed (Result: exit-code) since 2026-03-19 10:00:00 UTC
    Process: 12345 ExecStart=/usr/bin/myapp (code=exited, status=1/FAILURE)
   Main PID: 12345 (code=exited, status=1/FAILURE)
NRestarts=5
If this fails: If systemctl status shows not-found, the service name is wrong or the unit file is missing. Check with systemctl list-units | grep <PARTIAL_NAME>.

Step 2: Read Recent Journal Logs for the Service

Why: The exit code and last few lines of output almost always contain the actual error. This is the most important step — do not skip it.

# Show the last 100 lines of logs for the service
journalctl -u <SERVICE_NAME> -n 100 --no-pager

# Show logs since the last restart
journalctl -u <SERVICE_NAME> --since "10 minutes ago" --no-pager

# Show logs in reverse order (most recent first) to find the crash quickly
journalctl -u <SERVICE_NAME> -r -n 50 --no-pager

# If the service logs to a separate file
sudo tail -100 /var/log/<SERVICE_LOG_FILE>
Expected output:
Mar 19 10:00:05 myhost myapp[12345]: Starting application...
Mar 19 10:00:06 myhost myapp[12345]: ERROR: Failed to connect to database: connection refused
Mar 19 10:00:06 myhost myapp[12345]: Fatal error, exiting.
Mar 19 10:00:06 myhost systemd[1]: myservice.service: Main process exited, code=exited, status=1/FAILURE
If this fails: If logs show nothing (empty output), the service may be logging to stdout but systemd is not capturing it. Check StandardOutput= and StandardError= settings in the unit file: systemctl cat <SERVICE_NAME>.

Step 3: Check Exit Code and Signal

Why: The exit code and termination signal tell you the category of failure — config error (exit 1), segfault (signal 11), OOM kill (signal 9), or permission denied (exit 13).

# Check the exit code from the most recent run
systemctl show <SERVICE_NAME> --property=ExecMainStatus

# Check if it was killed by a signal
journalctl -u <SERVICE_NAME> -n 10 | grep -i "signal\|killed\|exit-code"

# Decode common exit codes
# exit 1:  General error (check logs)
# exit 2:  Misuse of shell built-in (often script error)
# exit 126: Permission denied or not executable
# exit 127: Command not found (binary missing or wrong path)
# exit 130: SIGINT (Ctrl-C equivalent)
# exit 137: SIGKILL (OOM or manual kill)
# exit 139: SIGSEGV (segmentation fault — binary bug)
# exit 143: SIGTERM (graceful termination requested)
Expected output:
ExecMainStatus=1
If this fails: Exit code 127 means the binary path in ExecStart= is wrong or the binary does not exist. Verify: ls -lh $(systemctl show <SERVICE_NAME> --property=ExecStart --value | awk '{print $1}').

Step 4: Try Manual Start to Reproduce the Error

Why: Running the command manually (outside of systemd) removes systemd's environment isolation and often produces more visible error output, especially for config and permission errors.

# Get the actual command that systemd runs
systemctl cat <SERVICE_NAME> | grep ExecStart

# Try running it manually as the service user
sudo -u <SERVICE_USER> <EXACT_EXECSTART_COMMAND>

# Or run it directly with more verbose output
sudo -u <SERVICE_USER> <BINARY_PATH> --verbose 2>&1

# Check if the working directory and environment are correct
systemctl show <SERVICE_NAME> --property=WorkingDirectory,User,Group,Environment
sudo -u <SERVICE_USER> env <ENVIRONMENT_VARIABLES> <BINARY_PATH>
Expected output:
# The error should now be visible in your terminal, e.g.:
Error: config file /etc/myapp/config.yaml: no such file or directory
If this fails: If the manual run works but systemd run fails, the issue is in the systemd environment — missing environment variables, different user permissions, or SELinux/AppArmor denials. Check journalctl -u <SERVICE_NAME> | grep -i avc\|denied for security policy blocks.

Step 5: Fix the Config or Binary Issue

Why: The fix depends entirely on what Step 4 revealed. Common fixes are documented here for the most frequent failure modes.

# Missing config file — restore or recreate it
sudo cp /etc/myapp/config.yaml.bak /etc/myapp/config.yaml
sudo systemctl daemon-reload  # Only needed if unit file changed

# Permission denied on binary or config
sudo chmod +x <BINARY_PATH>
sudo chown <SERVICE_USER>:<SERVICE_GROUP> <CONFIG_PATH>

# Missing dependency (database, other service)
# Check if the dependency is running:
systemctl status <DEPENDENCY_SERVICE>
# Fix the unit file to wait for the dependency:
sudo systemctl edit <SERVICE_NAME>
# Add: [Unit]
#      After=<DEPENDENCY_SERVICE>.service
#      Requires=<DEPENDENCY_SERVICE>.service

# Wrong environment variable
sudo systemctl edit <SERVICE_NAME>
# Add: [Service]
#      Environment="MY_VAR=correct_value"
#      EnvironmentFile=/etc/myapp/env

sudo systemctl daemon-reload
Expected output:
# After fix: manual start should succeed
# Application started successfully on port 8080
If this fails: If the binary itself is corrupted or outdated, redeploy: sudo apt-get install --reinstall <PACKAGE_NAME> or re-pull from your artifact repository.

Step 6: Clear Failed State and Restart

Why: When a service hits its restart rate limit (StartLimitBurst), systemd puts it into a permanent failed state and will not attempt further restarts — even after you fix the underlying problem. You must reset this manually.

# Check if service is in failed state due to restart rate limit
systemctl is-failed <SERVICE_NAME>

# Reset the failed state
sudo systemctl reset-failed <SERVICE_NAME>

# Now start the service
sudo systemctl start <SERVICE_NAME>

# Watch the service status for 30 seconds to confirm it stays up
watch -n 2 systemctl status <SERVICE_NAME>

# Enable the service to start on boot if not already enabled
sudo systemctl enable <SERVICE_NAME>
Expected output:
● myservice.service - My Application Service
     Active: active (running) since 2026-03-19 10:15:00 UTC; 30s ago
   Main PID: 13000 (myapp)
If this fails: If the service fails again immediately after systemctl start, return to Step 4 — the root cause was not fully resolved. Do not loop systemctl restart without reading logs each time.

Verification

# Confirm the issue is resolved
systemctl is-active <SERVICE_NAME> && journalctl -u <SERVICE_NAME> -n 20 --no-pager
Success looks like: systemctl is-active returns active. Journal shows clean startup with no errors. Monitoring alert has cleared. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Application team on-call "Service on is crash looping and cannot be stabilized, root cause not identified"
Data loss suspected Application team lead "Service crashed unexpectedly, in-flight requests may have been lost, data integrity check needed"
Scope expanding to multiple nodes SRE lead "Service crash looping on multiple nodes simultaneously, likely a bad deploy or shared dependency outage"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Using systemctl restart without clearing failed state first: If the service has hit its StartLimitBurst, systemctl restart does nothing — systemd refuses to start a unit in failed state from rate limiting. You must run systemctl reset-failed <SERVICE_NAME> first, or the restart attempt is silently ignored.
  2. Not reading the actual error in journalctl: It is tempting to jump straight to systemctl restart without checking logs. This wastes time (service crashes again) and loses context. Always read the logs first — the error is almost always there.
  3. Reloading systemd daemon unnecessarily: systemctl daemon-reload is only needed when a unit file on disk has changed. Running it when debugging a service that has not had its unit file changed has no effect and wastes time. Run it once after editing a unit file, not after every failed restart attempt.

Cross-References


Wiki Navigation