systemctl & journalctl Street Ops¶
Real troubleshooting workflows. Each section is a problem you will hit in production, with the diagnostic sequence and fix.
Why Won't My Service Start?¶
This is the most common systemd question. Here is the diagnostic sequence, in order:
Step 1: Read the Status¶
Look at three things:
1. Active line -- failed, inactive, or activating
2. Loaded line -- is the unit file found? Is it enabled?
3. Last 10 log lines -- often contains the answer
Step 2: Get More Logs¶
Common patterns:
| Log message | Meaning |
|---|---|
code=exited, status=203/EXEC |
Binary not found or not executable |
code=exited, status=217/USER |
User= in unit file does not exist |
code=exited, status=226/NAMESPACE |
Sandboxing directive failed |
code=exited, status=200/CHDIR |
WorkingDirectory= does not exist |
Permission denied |
File permissions, SELinux, or AppArmor |
Address already in use |
Port conflict with another process |
Step 3: Check the Unit File¶
Verify:
- ExecStart= path exists and is executable
- User= and Group= exist on the system
- WorkingDirectory= exists
- EnvironmentFile= exists and is readable
Step 4: Check Dependencies¶
If a Requires= dependency is failed, your service will not start.
Step 5: Run It Manually¶
# As the service user, with the same command
sudo -u myapp /usr/local/bin/myapp --config /etc/myapp/config.yaml
If it works manually but not under systemd, the problem is environment (missing env vars, different PATH, SELinux context, or working directory).
Step 6: Check for Masking¶
If the output is masked, someone symlinked the unit to /dev/null:
Service Restart Loops¶
Symptoms: service flaps between active and failed. Journal shows
repeated start/stop cycles.
Diagnosis¶
# How many times has it restarted?
systemctl show myapp -p NRestarts
# What are the restart limits?
systemctl show myapp -p StartLimitBurst -p StartLimitIntervalSec
# Is it in failed state due to rate limiting?
systemctl status myapp
# Look for: "Start request repeated too quickly"
Common Causes¶
Config error causing immediate crash:
The service starts, reads bad config, exits non-zero. Restart=always
restarts it. It crashes again. Repeat until StartLimitBurst is hit.
# Fix: check the config
journalctl -u myapp -b | head -50
# Look for config parse errors in the first few lines
Missing dependency at runtime:
Database is down. Service starts, tries to connect, fails, exits.
# Fix: add proper dependency
# In unit file:
# Requires=postgresql.service
# After=postgresql.service
Port conflict:
Another process grabbed the port. Service starts, fails to bind, exits.
Recovery¶
Once the service hits the start limit, it enters failed state and
will not restart even if you fix the underlying problem:
# Reset the failure counter
systemctl reset-failed myapp.service
# Now start it
systemctl start myapp.service
Tuning Restart Behavior¶
[Service]
Restart=on-failure
RestartSec=10 # Wait 10s between restarts
StartLimitIntervalSec=300 # Window for counting starts
StartLimitBurst=5 # Max starts in window
This allows 5 restarts in 5 minutes, with 10 seconds between each.
Debugging Socket Activation¶
Socket activation issues are subtle because two units are involved:
the .socket and the .service.
"I stopped the service but it keeps coming back"¶
The socket unit is still active. Any new connection reactivates the service:
# Check if the socket is active
systemctl status myapp.socket
# Stop both
systemctl stop myapp.socket myapp.service
"Service starts but connections fail"¶
The service must receive the socket as file descriptor 3. If the service opens its own socket instead of using the passed FD, you get port conflicts or connections that never reach the service.
# Verify the socket is actually passed
systemctl show myapp.service -p StatusText
journalctl -u myapp.service | grep -i "socket\|listen\|fd"
"Socket activation works for first connection then dies"¶
Check if Accept=yes is set in the socket unit. With Accept=yes,
systemd spawns a new service instance per connection. With Accept=no
(default), the service handles all connections on the same FD.
If the service exits after handling one connection and Accept=no:
Timer Not Firing¶
Step 1: Is the Timer Active?¶
Look for your timer. Check the NEXT and LAST columns. If NEXT
says n/a, the timer is not scheduled.
Step 2: Validate the Calendar Expression¶
# Check if the expression is valid
systemd-analyze calendar "Mon *-*-* 02:00:00"
# See when it will next fire
systemd-analyze calendar "Mon *-*-* 02:00:00" --iterations=5
Common mistakes:
| Wrong | Right | Issue |
|---|---|---|
OnCalendar=2:00 |
OnCalendar=*-*-* 02:00:00 |
Missing date portion |
OnCalendar=Mon-Fri *-*-* 02:00 |
OnCalendar=Mon..Fri *-*-* 02:00:00 |
Range is .. not - |
OnCalendar=*/15 * * * * |
OnCalendar=*:0/15 |
This is not cron syntax |
Step 3: Check the Paired Service¶
The timer triggers a service with the same name (minus .timer).
If the service fails, the timer still shows as active.
Step 4: Did You Enable the Timer?¶
Step 5: Missed Runs¶
If the system was off when the timer should have fired:
Without Persistent=true, missed runs are silently lost.
Overriding Vendor Units with Drop-ins¶
The Scenario¶
You need to add an environment variable to nginx without replacing the entire vendor unit file.
The Fix¶
This opens an editor. Add:
Save and exit. systemd automatically runs daemon-reload.
The file is saved at:
/etc/systemd/system/nginx.service.d/override.conf
Replacing ExecStart¶
ExecStart is a replacing directive. You must clear it first:
Without the empty ExecStart=, you get:
Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services.
Viewing Effective Configuration¶
# See the final merged result
systemctl cat nginx.service
# See what overrides exist
systemd-delta --type=extended
Removing an Override¶
Emergency Service Recovery¶
Service Stuck in Failed State¶
# Reset the failure counter
systemctl reset-failed myapp.service
# Now you can start it again
systemctl start myapp.service
Service Will Not Stop (Hung Process)¶
# Check what's happening
systemctl status myapp.service
# If it says "Deactivating (stop-sigterm)..."
# Force kill
systemctl kill myapp.service --signal=SIGKILL
# If that fails, find the cgroup and kill everything in it
systemctl show myapp.service -p ControlGroup
# Kill all processes in that cgroup
systemctl kill myapp.service --kill-who=all --signal=9
Unit File Syntax Error Prevents Start¶
# Verify syntax
systemd-analyze verify /etc/systemd/system/myapp.service
# If the file is broken, fix it, then:
systemctl daemon-reload
systemctl start myapp.service
Need to Start a Service That's Masked¶
# Check if masked
systemctl is-enabled myapp.service
# "masked"
# Unmask it
systemctl unmask myapp.service
systemctl start myapp.service
Finding Resource-Hogging Services¶
Live Monitoring¶
# systemd-aware top (shows CPU, memory, I/O per cgroup)
systemd-cgtop
# Sort by memory
systemd-cgtop -m
# Sort by CPU
systemd-cgtop -c
Point-in-Time Queries¶
# Memory usage of a specific service
systemctl show nginx -p MemoryCurrent
systemctl show nginx -p MemoryPeak
# CPU time consumed
systemctl show nginx -p CPUUsageNSec
# Number of processes/threads
systemctl show nginx -p TasksCurrent
# All resource properties
systemctl show nginx -p MemoryCurrent -p MemoryPeak -p CPUUsageNSec \
-p TasksCurrent -p IPIngressBytes -p IPEgressBytes
Finding the Worst Offenders¶
# List all services with their memory usage
for svc in $(systemctl list-units --type=service --state=running \
--no-legend --no-pager | awk '{print $1}'); do
mem=$(systemctl show "$svc" -p MemoryCurrent --value 2>/dev/null)
if [ "$mem" != "[not set]" ] && [ -n "$mem" ]; then
echo "$mem $svc"
fi
done | sort -rn | head -20
Setting Limits on Offenders¶
# Quick temporary limit (no unit file edit needed)
systemctl set-property nginx.service MemoryMax=1G
systemctl set-property nginx.service CPUQuota=150%
# These persist across restarts (written to drop-in)
# To make temporary only:
systemctl set-property --runtime nginx.service MemoryMax=1G
Using systemd-run for One-Off Contained Commands¶
systemd-run creates transient units -- services, scopes, or timers
that exist only for the duration of the command.
Resource-Limited One-Off¶
# Run a script with memory and CPU limits
systemd-run --scope -p MemoryMax=512M -p CPUQuota=100% \
/usr/local/bin/data-import.sh
# Run with I/O throttling
systemd-run --scope -p IOWeight=10 \
rsync -a /backup/source/ /backup/dest/
Named Transient Service¶
# Create a named transient service (visible in systemctl)
systemd-run --unit=manual-migration \
--description="Database migration" \
-p MemoryMax=2G \
/usr/local/bin/db-migrate --full
# Monitor it
systemctl status manual-migration
journalctl -u manual-migration -f
Transient Timer¶
# Run a command in 30 minutes
systemd-run --on-active=30min /usr/local/bin/cleanup.sh
# Run a command at a specific time
systemd-run --on-calendar="2025-03-20 02:00:00" /usr/local/bin/maintenance.sh
Running as a Different User¶
systemd-run --uid=backup --gid=backup \
-p ProtectSystem=strict -p PrivateTmp=yes \
/usr/local/bin/backup.sh
Analyzing Boot Performance¶
Quick Overview¶
# Total boot time
systemd-analyze
# Startup finished in 2.345s (kernel) + 5.678s (userspace) = 8.023s
# Slowest units
systemd-analyze blame | head -20
Finding the Critical Path¶
# Which units are on the critical path?
systemd-analyze critical-chain
# Critical chain for a specific service
systemd-analyze critical-chain nginx.service
Output looks like:
multi-user.target @8.012s
+- nginx.service @7.500s +512ms
+- network-online.target @7.450s
+- NetworkManager-wait-online.service @2.100s +5.350s
This tells you nginx took 512ms, but it was blocked waiting for
NetworkManager-wait-online.service which took 5.35 seconds.
Visual Boot Chart¶
# Generate SVG plot of entire boot
systemd-analyze plot > boot.svg
# Open in browser to see parallel unit activation
Common Boot Slowdowns¶
| Culprit | Fix |
|---|---|
NetworkManager-wait-online.service |
Disable if not needed, or switch to network.target |
systemd-udev-settle.service |
Usually a broken udev rule |
plymouth-quit-wait.service |
Disable splash screen on servers |
fstrim.timer |
Not a boot issue, but triggers at startup |
| Large journal replay | Limit journal size in journald.conf |
Managing User Services¶
User services run in per-user systemd instances. No root required.
Setup¶
# Create the directory
mkdir -p ~/.config/systemd/user/
# Create a user service
cat > ~/.config/systemd/user/dev-server.service << 'EOF'
[Unit]
Description=Development server
[Service]
ExecStart=/home/alice/bin/dev-server --port 3000
Restart=on-failure
WorkingDirectory=/home/alice/projects/myapp
[Install]
WantedBy=default.target
EOF
# Reload and start
systemctl --user daemon-reload
systemctl --user enable --now dev-server.service
Viewing Logs¶
Lingering¶
By default, user services only run while the user has an active login session. To keep them running after logout:
Common User Service Use Cases¶
- Development servers and watchers
- SSH tunnel maintenance
- Notification daemons
- Personal backup timers
- Syncthing, Tailscale userspace mode