Skip to content

systemctl & journalctl Street Ops

Real troubleshooting workflows. Each section is a problem you will hit in production, with the diagnostic sequence and fix.


Why Won't My Service Start?

This is the most common systemd question. Here is the diagnostic sequence, in order:

Step 1: Read the Status

systemctl status myapp.service

Look at three things: 1. Active line -- failed, inactive, or activating 2. Loaded line -- is the unit file found? Is it enabled? 3. Last 10 log lines -- often contains the answer

Step 2: Get More Logs

journalctl -u myapp.service -n 50 --no-pager

Common patterns:

Log message Meaning
code=exited, status=203/EXEC Binary not found or not executable
code=exited, status=217/USER User= in unit file does not exist
code=exited, status=226/NAMESPACE Sandboxing directive failed
code=exited, status=200/CHDIR WorkingDirectory= does not exist
Permission denied File permissions, SELinux, or AppArmor
Address already in use Port conflict with another process

Step 3: Check the Unit File

systemctl cat myapp.service

Verify: - ExecStart= path exists and is executable - User= and Group= exist on the system - WorkingDirectory= exists - EnvironmentFile= exists and is readable

Step 4: Check Dependencies

systemctl list-dependencies myapp.service

If a Requires= dependency is failed, your service will not start.

Step 5: Run It Manually

# As the service user, with the same command
sudo -u myapp /usr/local/bin/myapp --config /etc/myapp/config.yaml

If it works manually but not under systemd, the problem is environment (missing env vars, different PATH, SELinux context, or working directory).

Step 6: Check for Masking

systemctl is-enabled myapp.service

If the output is masked, someone symlinked the unit to /dev/null:

systemctl unmask myapp.service

Service Restart Loops

Symptoms: service flaps between active and failed. Journal shows repeated start/stop cycles.

Diagnosis

# How many times has it restarted?
systemctl show myapp -p NRestarts

# What are the restart limits?
systemctl show myapp -p StartLimitBurst -p StartLimitIntervalSec

# Is it in failed state due to rate limiting?
systemctl status myapp
# Look for: "Start request repeated too quickly"

Common Causes

Config error causing immediate crash:

The service starts, reads bad config, exits non-zero. Restart=always restarts it. It crashes again. Repeat until StartLimitBurst is hit.

# Fix: check the config
journalctl -u myapp -b | head -50
# Look for config parse errors in the first few lines

Missing dependency at runtime:

Database is down. Service starts, tries to connect, fails, exits.

# Fix: add proper dependency
# In unit file:
#   Requires=postgresql.service
#   After=postgresql.service

Port conflict:

Another process grabbed the port. Service starts, fails to bind, exits.

# Find what's on the port
ss -tlnp | grep :8080

Recovery

Once the service hits the start limit, it enters failed state and will not restart even if you fix the underlying problem:

# Reset the failure counter
systemctl reset-failed myapp.service

# Now start it
systemctl start myapp.service

Tuning Restart Behavior

[Service]
Restart=on-failure
RestartSec=10                  # Wait 10s between restarts
StartLimitIntervalSec=300      # Window for counting starts
StartLimitBurst=5              # Max starts in window

This allows 5 restarts in 5 minutes, with 10 seconds between each.


Debugging Socket Activation

Socket activation issues are subtle because two units are involved: the .socket and the .service.

"I stopped the service but it keeps coming back"

systemctl stop myapp.service
# A minute later, myapp is running again

The socket unit is still active. Any new connection reactivates the service:

# Check if the socket is active
systemctl status myapp.socket

# Stop both
systemctl stop myapp.socket myapp.service

"Service starts but connections fail"

The service must receive the socket as file descriptor 3. If the service opens its own socket instead of using the passed FD, you get port conflicts or connections that never reach the service.

# Verify the socket is actually passed
systemctl show myapp.service -p StatusText
journalctl -u myapp.service | grep -i "socket\|listen\|fd"

"Socket activation works for first connection then dies"

Check if Accept=yes is set in the socket unit. With Accept=yes, systemd spawns a new service instance per connection. With Accept=no (default), the service handles all connections on the same FD.

If the service exits after handling one connection and Accept=no:

[Service]
Restart=on-failure

Timer Not Firing

Step 1: Is the Timer Active?

systemctl list-timers --all

Look for your timer. Check the NEXT and LAST columns. If NEXT says n/a, the timer is not scheduled.

Step 2: Validate the Calendar Expression

# Check if the expression is valid
systemd-analyze calendar "Mon *-*-* 02:00:00"

# See when it will next fire
systemd-analyze calendar "Mon *-*-* 02:00:00" --iterations=5

Common mistakes:

Wrong Right Issue
OnCalendar=2:00 OnCalendar=*-*-* 02:00:00 Missing date portion
OnCalendar=Mon-Fri *-*-* 02:00 OnCalendar=Mon..Fri *-*-* 02:00:00 Range is .. not -
OnCalendar=*/15 * * * * OnCalendar=*:0/15 This is not cron syntax

Step 3: Check the Paired Service

The timer triggers a service with the same name (minus .timer). If the service fails, the timer still shows as active.

# Check the service
systemctl status db-backup.service
journalctl -u db-backup.service -n 20

Step 4: Did You Enable the Timer?

systemctl is-enabled db-backup.timer
# Must be "enabled"

systemctl enable --now db-backup.timer

Step 5: Missed Runs

If the system was off when the timer should have fired:

[Timer]
Persistent=true

Without Persistent=true, missed runs are silently lost.


Overriding Vendor Units with Drop-ins

The Scenario

You need to add an environment variable to nginx without replacing the entire vendor unit file.

The Fix

systemctl edit nginx.service

This opens an editor. Add:

[Service]
Environment="CUSTOM_VAR=value"

Save and exit. systemd automatically runs daemon-reload.

The file is saved at: /etc/systemd/system/nginx.service.d/override.conf

Replacing ExecStart

ExecStart is a replacing directive. You must clear it first:

[Service]
ExecStart=
ExecStart=/usr/sbin/nginx -g 'daemon off;' -c /etc/nginx/custom.conf

Without the empty ExecStart=, you get: Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services.

Viewing Effective Configuration

# See the final merged result
systemctl cat nginx.service

# See what overrides exist
systemd-delta --type=extended

Removing an Override

# Remove the drop-in directory
rm -rf /etc/systemd/system/nginx.service.d/
systemctl daemon-reload

Emergency Service Recovery

Service Stuck in Failed State

# Reset the failure counter
systemctl reset-failed myapp.service

# Now you can start it again
systemctl start myapp.service

Service Will Not Stop (Hung Process)

# Check what's happening
systemctl status myapp.service
# If it says "Deactivating (stop-sigterm)..."

# Force kill
systemctl kill myapp.service --signal=SIGKILL

# If that fails, find the cgroup and kill everything in it
systemctl show myapp.service -p ControlGroup
# Kill all processes in that cgroup
systemctl kill myapp.service --kill-who=all --signal=9

Unit File Syntax Error Prevents Start

# Verify syntax
systemd-analyze verify /etc/systemd/system/myapp.service

# If the file is broken, fix it, then:
systemctl daemon-reload
systemctl start myapp.service

Need to Start a Service That's Masked

# Check if masked
systemctl is-enabled myapp.service
# "masked"

# Unmask it
systemctl unmask myapp.service
systemctl start myapp.service

Finding Resource-Hogging Services

Live Monitoring

# systemd-aware top (shows CPU, memory, I/O per cgroup)
systemd-cgtop

# Sort by memory
systemd-cgtop -m

# Sort by CPU
systemd-cgtop -c

Point-in-Time Queries

# Memory usage of a specific service
systemctl show nginx -p MemoryCurrent
systemctl show nginx -p MemoryPeak

# CPU time consumed
systemctl show nginx -p CPUUsageNSec

# Number of processes/threads
systemctl show nginx -p TasksCurrent

# All resource properties
systemctl show nginx -p MemoryCurrent -p MemoryPeak -p CPUUsageNSec \
  -p TasksCurrent -p IPIngressBytes -p IPEgressBytes

Finding the Worst Offenders

# List all services with their memory usage
for svc in $(systemctl list-units --type=service --state=running \
  --no-legend --no-pager | awk '{print $1}'); do
  mem=$(systemctl show "$svc" -p MemoryCurrent --value 2>/dev/null)
  if [ "$mem" != "[not set]" ] && [ -n "$mem" ]; then
    echo "$mem $svc"
  fi
done | sort -rn | head -20

Setting Limits on Offenders

# Quick temporary limit (no unit file edit needed)
systemctl set-property nginx.service MemoryMax=1G
systemctl set-property nginx.service CPUQuota=150%

# These persist across restarts (written to drop-in)
# To make temporary only:
systemctl set-property --runtime nginx.service MemoryMax=1G

Using systemd-run for One-Off Contained Commands

systemd-run creates transient units -- services, scopes, or timers that exist only for the duration of the command.

Resource-Limited One-Off

# Run a script with memory and CPU limits
systemd-run --scope -p MemoryMax=512M -p CPUQuota=100% \
  /usr/local/bin/data-import.sh

# Run with I/O throttling
systemd-run --scope -p IOWeight=10 \
  rsync -a /backup/source/ /backup/dest/

Named Transient Service

# Create a named transient service (visible in systemctl)
systemd-run --unit=manual-migration \
  --description="Database migration" \
  -p MemoryMax=2G \
  /usr/local/bin/db-migrate --full

# Monitor it
systemctl status manual-migration
journalctl -u manual-migration -f

Transient Timer

# Run a command in 30 minutes
systemd-run --on-active=30min /usr/local/bin/cleanup.sh

# Run a command at a specific time
systemd-run --on-calendar="2025-03-20 02:00:00" /usr/local/bin/maintenance.sh

Running as a Different User

systemd-run --uid=backup --gid=backup \
  -p ProtectSystem=strict -p PrivateTmp=yes \
  /usr/local/bin/backup.sh

Analyzing Boot Performance

Quick Overview

# Total boot time
systemd-analyze
# Startup finished in 2.345s (kernel) + 5.678s (userspace) = 8.023s

# Slowest units
systemd-analyze blame | head -20

Finding the Critical Path

# Which units are on the critical path?
systemd-analyze critical-chain

# Critical chain for a specific service
systemd-analyze critical-chain nginx.service

Output looks like:

multi-user.target @8.012s
+- nginx.service @7.500s +512ms
  +- network-online.target @7.450s
    +- NetworkManager-wait-online.service @2.100s +5.350s

This tells you nginx took 512ms, but it was blocked waiting for NetworkManager-wait-online.service which took 5.35 seconds.

Visual Boot Chart

# Generate SVG plot of entire boot
systemd-analyze plot > boot.svg
# Open in browser to see parallel unit activation

Common Boot Slowdowns

Culprit Fix
NetworkManager-wait-online.service Disable if not needed, or switch to network.target
systemd-udev-settle.service Usually a broken udev rule
plymouth-quit-wait.service Disable splash screen on servers
fstrim.timer Not a boot issue, but triggers at startup
Large journal replay Limit journal size in journald.conf

Managing User Services

User services run in per-user systemd instances. No root required.

Setup

# Create the directory
mkdir -p ~/.config/systemd/user/

# Create a user service
cat > ~/.config/systemd/user/dev-server.service << 'EOF'
[Unit]
Description=Development server

[Service]
ExecStart=/home/alice/bin/dev-server --port 3000
Restart=on-failure
WorkingDirectory=/home/alice/projects/myapp

[Install]
WantedBy=default.target
EOF

# Reload and start
systemctl --user daemon-reload
systemctl --user enable --now dev-server.service

Viewing Logs

journalctl --user -u dev-server.service
journalctl --user -u dev-server.service -f

Lingering

By default, user services only run while the user has an active login session. To keep them running after logout:

# As root
loginctl enable-linger alice

# Verify
loginctl show-user alice -p Linger

Common User Service Use Cases

  • Development servers and watchers
  • SSH tunnel maintenance
  • Notification daemons
  • Personal backup timers
  • Syncthing, Tailscale userspace mode