Linux Logging — Street Ops¶

Real-world operational scenarios for log-related problems. These are the situations you'll encounter in production: finding why services failed, dealing with disk-filling logs, setting up log forwarding, and correlating events across systems.

Finding Why a Service Failed¶

Scenario: nginx stopped, users are reporting 502s, need to find the root cause fast¶

# Step 1: Check the service status (often shows the last few log lines)
$ systemctl status nginx.service
● nginx.service - A high performance web server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled)
     Active: failed (Result: exit-code) since Thu 2026-03-19 10:45:22 UTC; 2min ago
    Process: 12345 ExecStart=/usr/sbin/nginx -g daemon on; (code=exited, status=1/FAILURE)
   Main PID: 12345 (code=exited, status=1/FAILURE)

Mar 19 10:45:22 web01 nginx[12345]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)

# Step 2: Full logs for the service
$ journalctl -u nginx.service --since "10 minutes ago" --no-pager

# Step 3: If the error is clear (port 80 in use), find what's using it
$ sudo ss -tlnp | grep :80
LISTEN  0  511  0.0.0.0:80  0.0.0.0:*  users:(("apache2",pid=9876,fd=4))
# Apache is running on port 80 — kill it or reconfigure

# Step 4: For less obvious failures, check related logs
$ journalctl -u nginx.service -u php-fpm.service --since "10 minutes ago"

# Step 5: Check nginx's own error log
$ tail -50 /var/log/nginx/error.log

Scenario: A service is crash-looping but logs don't show why¶

# Check for core dumps
$ coredumpctl list | tail -5
TIME                            PID   UID   GID SIG COREFILE  EXE
Thu 2026-03-19 10:45:22 UTC   12345  1000  1000  11 present   /usr/bin/myapp

# Get details
$ coredumpctl info 12345

# Check if the process is being OOM-killed (not always in service logs)
$ journalctl -k --grep="oom.*myapp|killed.*myapp" --since "1 hour ago"

# Check the exit code pattern
$ journalctl -u myapp.service | grep "Main process exited"
Mar 19 10:45:22 server01 systemd[1]: myapp.service: Main process exited, code=killed, status=9/KILL
# Status 9/KILL = SIGKILL = likely OOM killer

# Check for segfaults
$ journalctl -k --grep="segfault" --since "1 hour ago"
Mar 19 10:45:21 server01 kernel: myapp[12345]: segfault at 0 ip 00007f... sp 00007f... error 4 in libc.so

Correlating Logs Across Services¶

Scenario: User reports intermittent errors, need to trace a request through multiple services¶

# If you have request IDs in your logs, correlate by ID
$ REQUEST_ID="abc-123-def-456"

# Search across all journal entries
$ journalctl --since "1 hour ago" | grep "$REQUEST_ID"

# Search across specific services
$ journalctl -u nginx -u myapp -u postgresql --since "1 hour ago" | grep "$REQUEST_ID"

# Search log files directly
$ grep -rn "$REQUEST_ID" /var/log/nginx/ /var/log/myapp/

# If no request ID, correlate by timestamp
# Find the approximate time from the user report, then:
$ journalctl --since "2026-03-19 10:44:00" --until "2026-03-19 10:46:00" \
    -u nginx -u myapp -u redis -u postgresql

# Output as JSON for easier parsing
$ journalctl --since "10:44" --until "10:46" -u nginx -u myapp -o json | \
    jq -r '[.REALTIME_USEC, .__REALTIME_TIMESTAMP, ._SYSTEMD_UNIT, .MESSAGE] | @tsv' | \
    sort -n

# Check for error patterns across all services in a time window
$ journalctl --since "10:44" --until "10:46" -p err

Disk Filling from Logs — Emergency Response¶

Scenario: Disk is 98% full, /var/log is consuming 50 GB¶

# Step 1: Find the biggest log files
$ sudo du -ah /var/log/ | sort -rh | head -20
45G     /var/log/myapp/debug.log
3.2G    /var/log/syslog
1.1G    /var/log/auth.log
...

# Step 2: Emergency truncate (NOT delete) the biggest offender
# TRUNCATE keeps the inode and file descriptor intact
$ sudo truncate -s 0 /var/log/myapp/debug.log

# WHY truncate, not delete?
# If you `rm` a file that a process has open, the disk space is NOT freed
# until the process closes the file descriptor. truncate frees space immediately.

# If you already deleted the file, find the process holding it open:
$ sudo lsof +L1 | grep deleted
myapp   12345  myapp  5w  REG  8,2  48000000000  0 /var/log/myapp/debug.log (deleted)
# Restart the process to free the space:
$ sudo systemctl restart myapp

# Or truncate through the /proc fd:
$ sudo truncate -s 0 /proc/12345/fd/5

# Step 3: Emergency logrotate
$ sudo logrotate -f /etc/logrotate.d/myapp

# Step 4: Fix the root cause — why is it logging so much?
# Check if debug logging was accidentally left on
$ grep -i "debug\|log.level\|loglevel" /etc/myapp/config.yaml

# Step 5: Set up disk monitoring to catch this earlier
# Quick check you can add to cron:
$ df -h /var | awk 'NR==2 {gsub(/%/,"",$5); if($5 > 85) print "WARNING: /var is " $5 "% full"}'

Dealing with journal eating all disk¶

# Check journal disk usage
$ journalctl --disk-usage
Archived and active journals take up 8.5G in /var/log/journal.

# Emergency: vacuum journal to reclaim space
$ sudo journalctl --vacuum-size=500M    # Keep only 500 MB
$ sudo journalctl --vacuum-time=3d      # Keep only 3 days

# Set permanent limits
$ sudo mkdir -p /etc/systemd/journald.conf.d/
$ cat <<EOF | sudo tee /etc/systemd/journald.conf.d/size.conf
[Journal]
SystemMaxUse=1G
SystemMaxFileSize=128M
SystemKeepFree=4G
EOF

$ sudo systemctl restart systemd-journald

Remote Log Forwarding Setup¶

Scenario: Set up centralized logging from 50 servers to a log aggregator¶

On the receiving server (log aggregator):

# /etc/rsyslog.d/10-receive.conf
module(load="imtcp")
input(type="imtcp" port="514" ruleset="remote")

template(name="RemoteFilePath" type="string"
    string="/var/log/remote/%HOSTNAME%/%PROGRAMNAME%.log")

ruleset(name="remote") {
    action(type="omfile" dynaFile="RemoteFilePath")
}

$ sudo systemctl restart rsyslog
$ sudo ss -tlnp | grep 514    # Verify listening

On each sending server:

# /etc/rsyslog.d/50-forward.conf

# Forward everything via TCP (reliable)
*.* @@logserver.example.com:514

# Or forward only important stuff
auth,authpriv.*  @@logserver.example.com:514
*.err            @@logserver.example.com:514
local0.*         @@logserver.example.com:514

# With queue for reliability (if network is flaky)
*.* action(
    type="omfwd"
    target="logserver.example.com"
    port="514"
    protocol="tcp"
    queue.type="LinkedList"
    queue.filename="fwdRule1"
    queue.maxdiskspace="1g"
    queue.saveonshutdown="on"
    action.resumeRetryCount="-1"
    action.resumeInterval="30"
)

$ sudo systemctl restart rsyslog

# Test
$ logger -t test-forward "Testing log forwarding from $(hostname)"

# Verify on the log server
$ tail /var/log/remote/$(hostname)/test-forward.log

Filtering Journal by Time, Priority, and Unit¶

Comprehensive filtering examples¶

# By time — absolute
$ journalctl --since "2026-03-19 10:00:00" --until "2026-03-19 11:00:00"

# By time — relative
$ journalctl --since "2 hours ago"
$ journalctl --since "yesterday" --until "today"
$ journalctl --since "2026-03-18" --until "2026-03-19"

# By priority (shows that level AND above)
$ journalctl -p err                    # err + crit + alert + emerg
$ journalctl -p warning                # warning and above
$ journalctl -p 0..3                   # emerg through err (range)

# By unit
$ journalctl -u nginx.service
$ journalctl -u nginx -u php-fpm       # Multiple units
$ journalctl -u "docker*"              # Glob pattern

# By PID
$ journalctl _PID=12345

# By executable
$ journalctl _EXE=/usr/sbin/sshd

# By boot
$ journalctl -b                        # Current boot
$ journalctl -b -1                     # Previous boot
$ journalctl -b abc123                 # Specific boot ID

# Combine filters (AND logic)
$ journalctl -u nginx -p err --since "1 hour ago"

# By UID (all messages from a specific user)
$ journalctl _UID=1000

# Kernel messages only
$ journalctl -k
$ journalctl -k -p err                 # Kernel errors

# With grep (pattern matching on message content)
$ journalctl --grep="connection refused|timeout" --since "1 hour ago"
$ journalctl --grep="error" -i         # Case-insensitive

# Reverse order (newest first)
$ journalctl -r -n 50                  # Last 50, newest first

# Show only N lines
$ journalctl -n 100                    # Last 100 lines

# No pager (useful for piping)
$ journalctl -u nginx --no-pager | wc -l

Persistent Journal Storage¶

Scenario: System rebooted and you need logs from before the reboot, but they're gone¶

# Check current storage mode
$ journalctl --header | grep "File path"
# If the path is /run/log/journal/ → volatile (memory only)
# If the path is /var/log/journal/ → persistent

# Enable persistent storage
$ sudo mkdir -p /var/log/journal

# Set ownership and permissions
$ sudo systemd-tmpfiles --create --prefix /var/log/journal

# Restart journald
$ sudo systemctl restart systemd-journald

# Verify
$ journalctl --list-boots
# Should now show multiple boots after the change takes effect

# Configure retention
$ cat <<EOF | sudo tee /etc/systemd/journald.conf.d/persistence.conf
[Journal]
Storage=persistent
SystemMaxUse=2G
MaxRetentionSec=30day
EOF

$ sudo systemctl restart systemd-journald

Log-Based Alerting Patterns¶

Scenario: Set up basic alerts for critical log events¶

# Method 1: systemd path unit watching a log file
# /etc/systemd/system/log-alert.path
[Unit]
Description=Monitor auth log for break-in attempts

[Path]
PathModified=/var/log/auth.log

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/log-alert.service
[Unit]
Description=Check auth log for suspicious patterns

[Service]
Type=oneshot
ExecStart=/usr/local/bin/check-auth-alerts.sh

#!/bin/bash
# /usr/local/bin/check-auth-alerts.sh
# Check last 100 lines for brute force attempts
FAILURES=$(tail -100 /var/log/auth.log | grep -c "Failed password")
if [ "$FAILURES" -gt 20 ]; then
    echo "ALERT: $FAILURES failed password attempts detected on $(hostname)" | \
        mail -s "SSH Brute Force Alert" ops@example.com
fi

# Method 2: rsyslog action for immediate alerting
# /etc/rsyslog.d/60-alerts.conf
# Alert on any emergency-level message
*.emerg action(type="omprog" binary="/usr/local/bin/syslog-alert.sh")

# Alert on authentication failures
auth.* :msg, contains, "FAILED" action(type="omprog" binary="/usr/local/bin/auth-alert.sh")

# Method 3: journalctl in a monitoring loop
$ journalctl -f -p crit -o json | while read -r line; do
    msg=$(echo "$line" | jq -r '.MESSAGE')
    unit=$(echo "$line" | jq -r '._SYSTEMD_UNIT // "kernel"')
    echo "CRITICAL: [$unit] $msg" >> /var/log/critical-alerts.log
    # Send to monitoring system, Slack, PagerDuty, etc.
done

Investigating Sudden Log Volume Spike¶

Scenario: Disk usage spiked, need to find which service is flooding logs¶

# Check journal usage per unit
$ journalctl --disk-usage
Archived and active journals take up 4.5G in /var/log/journal.

# Find the chattiest services (by message count)
$ journalctl --since "1 hour ago" --output=json | \
    jq -r '._SYSTEMD_UNIT // "_TRANSPORT"' | sort | uniq -c | sort -rn | head -10
  45678 myapp.service
   2345 nginx.service
   1234 sshd.service
    567 systemd-journald.service

# Find the chattiest by time period
$ for unit in myapp nginx sshd; do
    count=$(journalctl -u $unit --since "1 hour ago" --no-pager -q | wc -l)
    echo "$count $unit"
done | sort -rn

# Check traditional log files
$ find /var/log -name "*.log" -mmin -60 -exec ls -lh {} \; | sort -k5 -rh | head -10

# Check if it's a specific error repeating
$ journalctl -u myapp.service --since "1 hour ago" --no-pager | \
    sort | uniq -c | sort -rn | head -5
  12345 Mar 19 10:45:22 server01 myapp[1234]: ERROR: Connection to database refused
   8901 Mar 19 10:45:22 server01 myapp[1234]: WARN: Retrying connection (attempt 1/3)

# Root cause: database is down, app is retry-spamming logs
# Fix: restart the database, then verify logs calm down

Recovering Logs from a Crashed System¶

Scenario: System crashed, need to extract logs from the disk¶

# Boot from live USB, mount the crashed system's disk
$ sudo mount /dev/sda2 /mnt

# Check traditional logs
$ ls -la /mnt/var/log/
$ tail -100 /mnt/var/log/syslog

# Read journal from the mounted disk
$ journalctl --directory=/mnt/var/log/journal/ --list-boots
$ journalctl --directory=/mnt/var/log/journal/ -b -1 --no-pager | tail -200
$ journalctl --directory=/mnt/var/log/journal/ -b -1 -p err

# Export for later analysis
$ journalctl --directory=/mnt/var/log/journal/ -b -1 -o json > /tmp/crashed-logs.json

# Check for kernel crash information
$ cat /mnt/var/log/kern.log | tail -50
$ journalctl --directory=/mnt/var/log/journal/ -b -1 -k | tail -50

Timezone Issues in Logs¶

Scenario: Logs from different servers have different timestamps, making correlation impossible¶

# Check system timezone
$ timedatectl
               Local time: Thu 2026-03-19 10:45:22 UTC
           Universal time: Thu 2026-03-19 10:45:22 UTC
                 Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes

# journalctl respects the local timezone by default
# Force UTC output:
$ journalctl --utc -u nginx --since "1 hour ago"

# Check what timezone rsyslog uses
$ grep -i "timezone\|utc" /etc/rsyslog.conf

# Force UTC timestamps in rsyslog
# /etc/rsyslog.d/00-utc.conf
$ActionFileDefaultTemplate RSYSLOG_FileFormat
# This uses RFC 3339 timestamps with timezone offset

# For application logs, ensure they log in UTC
# Most apps: set TZ=UTC in the environment
$ sudo systemctl edit myapp.service
[Service]
Environment="TZ=UTC"

# Standardize all servers to UTC (best practice for servers)
$ sudo timedatectl set-timezone UTC

Power One-Liners¶

Structured journalctl queries¶

# All errors since last boot
journalctl -b -p err

# Specific unit, last 30 minutes, JSON output
journalctl -u nginx --since "30 min ago" -o json-pretty

# Follow multiple units
journalctl -fu nginx -fu php-fpm

[!TIP] When to use: Targeted log investigation without grep gymnastics.