Linux Logging — Footguns¶
[!WARNING] These mistakes fill disks, lose evidence, expose secrets, and blind you during incidents. Every item here has caused a real production problem.
1. No logrotate configured — logs fill the disk¶
You deploy an application, it writes to /var/log/myapp.log, and nobody sets up log rotation. The file grows until /var is 100% full. Other services can't write logs. The database can't write its WAL. The system grinds to a halt.
# Symptoms:
df -h /var
# /dev/sda2 50G 50G 0 100% /var
du -sh /var/log/* | sort -rh | head -10
# 45G /var/log/myapp.log
# Emergency fix:
# Truncate (don't delete) the file — process keeps the file handle:
> /var/log/myapp.log
# or:
truncate -s 0 /var/log/myapp.log
# WRONG: rm /var/log/myapp.log
# The process still holds the file descriptor. Space isn't freed.
# lsof +L1 | grep deleted shows these zombie files.
# Set up logrotate:
cat > /etc/logrotate.d/myapp << 'EOF'
/var/log/myapp/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
copytruncate
maxsize 500M
}
EOF
# Test the config:
logrotate -d /etc/logrotate.d/myapp # Dry run
logrotate -f /etc/logrotate.d/myapp # Force rotation now
# Verify logrotate runs daily:
systemctl status logrotate.timer
# or check cron:
ls -la /etc/cron.daily/logrotate
2. Journal eating all disk space (SystemMaxUse not set)¶
systemd-journald stores logs in binary format under /var/log/journal/. By default on some distros, it can use up to 10% of the filesystem. On a 500GB /var partition, that's 50GB of journal data. Combined with application logs, this fills the disk.
# Check journal disk usage:
journalctl --disk-usage
# Archived and active journals take up 12.5G in the file system.
# Check current limits:
systemctl show systemd-journald | grep -i max
# SystemMaxUse= <-- empty = default (10% of filesystem)
# SystemMaxFileSize=
# Set limits in /etc/systemd/journald.conf:
[Journal]
SystemMaxUse=500M # Total disk usage cap
SystemMaxFileSize=50M # Max size per journal file
MaxRetentionSec=1month # Don't keep logs older than 1 month
MaxFileSec=1week # Rotate journal files weekly
# Apply:
systemctl restart systemd-journald
# Emergency cleanup:
journalctl --vacuum-size=500M # Shrink to 500MB
journalctl --vacuum-time=7d # Keep only last 7 days
# Check again:
journalctl --disk-usage
3. Logging passwords and secrets in debug mode¶
You enable debug logging to troubleshoot an issue. The application starts logging full HTTP request bodies, database queries with parameters, and API calls with authentication headers. Passwords, tokens, and PII are now in plain text in your log files — and in your centralized logging system, and in your log backups.
# Common culprits:
# - DEBUG log level in production
# - HTTP request body logging with auth headers
# - Database query logging with bind parameters
# - curl -v output redirected to logs (shows Authorization headers)
# - Environment variable dumps that include secrets
# What to check:
grep -rn "password\|secret\|token\|api_key\|authorization" /var/log/ --include="*.log" 2>/dev/null | head -20
# If you find secrets in logs:
# 1. Rotate the log files immediately
logrotate -f /etc/logrotate.d/myapp
# 2. Securely delete the old logs
shred -u /var/log/myapp.log.1
# 3. Check if logs were forwarded to a central system
# and purge there too
# 4. Rotate any exposed credentials
# Prevention:
# - Never log at DEBUG level in production
# - Use structured logging with field-level redaction
# - Configure log sanitization in your log pipeline
# - Review application logging config as part of deployment
4. rsyslog rate limiting dropping messages silently¶
rsyslog has built-in rate limiting that drops messages when a source sends too many in a short period. During an incident — exactly when you need logs most — the flood of error messages triggers rate limiting and rsyslog silently drops them. You're left with gaps in your logs during the critical window.
# Symptoms:
# In /var/log/syslog or /var/log/messages:
# Mar 19 02:15:33 server01 rsyslogd-2177: imuxsock begins to drop messages from pid 12345 due to rate-limiting
# Check current rate limit settings:
grep -i "ratelimit" /etc/rsyslog.conf /etc/rsyslog.d/*.conf 2>/dev/null
# Default is: 200 messages per 5-second interval per source
# Increase or disable rate limiting:
# In /etc/rsyslog.conf:
$imjournalRatelimitInterval 0 # Disable rate limit for journal input
$imjournalRatelimitBurst 0
# Or for the Unix socket input:
module(load="imuxsock" SysSock.RateLimit.Interval="0")
# For specific applications that generate lots of messages:
# Create a dedicated input with higher limits
input(type="imtcp" port="10514" RateLimit.Interval="0")
# Restart rsyslog:
systemctl restart rsyslog
# Verify — force a burst of messages:
logger -t test "message" && for i in $(seq 1 500); do logger -t test "burst $i"; done
grep "begins to drop" /var/log/syslog
5. Log timestamps in the wrong timezone¶
Server A logs in UTC. Server B logs in America/New_York. Your centralized logging system shows both without normalization. During an incident, you're correlating events across servers and the timestamps are 5 hours off. You reconstruct the wrong timeline.
# Check system timezone:
timedatectl
# Time zone: America/New_York (EDT, -0400)
# Check if rsyslog is using UTC or local time:
grep -i "utc\|timezone" /etc/rsyslog.conf 2>/dev/null
# Best practice: run all servers in UTC
sudo timedatectl set-timezone UTC
# For applications that log their own timestamps:
# Configure them to log in UTC (ISO 8601 format):
# 2026-03-19T02:15:33.123Z
# For rsyslog, use high-precision RFC3339 timestamps:
# In /etc/rsyslog.conf:
$ActionFileDefaultTemplate RSYSLOG_FileFormat
# This gives: 2026-03-19T02:15:33.123456+00:00
# Instead of the default:
# Mar 19 02:15:33 (no year, no timezone, no subsecond)
# For journald, timestamps are always stored in UTC internally.
# Display in UTC:
journalctl --utc
# NTP drift can also cause timestamp issues:
timedatectl show | grep NTPSynchronized
# NTPSynchronized=yes <-- good
# NTPSynchronized=no <-- timestamps may drift
chronyc tracking | grep "System time"
6. Volatile journal — not persisted by default on some distros¶
On some distros (notably Ubuntu until 22.04 and some minimal installs), journald stores logs only in /run/log/journal/ (tmpfs). This means journal logs are lost on every reboot. When a server crashes and reboots, the crash evidence is gone.
# Check if journal is persistent:
ls -la /var/log/journal/
# If this directory doesn't exist, journal is volatile!
# Check journald configuration:
grep -i "storage" /etc/systemd/journald.conf
# Storage=volatile <-- logs lost on reboot!
# Storage=auto <-- persistent IF /var/log/journal/ exists
# Storage=persistent <-- always persistent (creates directory if needed)
# Enable persistent journal:
sudo mkdir -p /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald
# Or set explicitly in /etc/systemd/journald.conf:
[Journal]
Storage=persistent
sudo systemctl restart systemd-journald
# Verify:
journalctl --list-boots
# Should show multiple boots, not just the current one:
# -2 abc... Tue 2026-03-17 10:00:00 — Tue 2026-03-17 18:30:00
# -1 def... Tue 2026-03-17 18:35:00 — Wed 2026-03-18 14:00:00
# 0 ghi... Wed 2026-03-18 14:05:00 — present
# View previous boot logs (crash forensics):
journalctl -b -1 # Previous boot
journalctl -b -1 -p err # Only errors from previous boot
7. logrotate with copytruncate on append-only files¶
copytruncate works by copying the log file, then truncating the original to zero. There's a window between the copy and the truncate where new log lines are written — those lines are lost. For high-volume applications, this can mean losing seconds of log data on every rotation.
# The problem with copytruncate:
# 1. rsyslog writes line A to myapp.log
# 2. logrotate copies myapp.log to myapp.log.1
# 3. rsyslog writes lines B, C, D to myapp.log (between copy and truncate)
# 4. logrotate truncates myapp.log to 0
# 5. Lines B, C, D are LOST — not in myapp.log.1, gone from myapp.log
# copytruncate is a workaround for apps that can't reopen log files.
# Better alternatives:
# Option 1: postrotate signal (if app supports SIGHUP):
/var/log/nginx/*.log {
daily
rotate 14
compress
postrotate
/bin/kill -USR1 $(cat /run/nginx.pid 2>/dev/null) 2>/dev/null || true
endscript
}
# Option 2: create directive (app opens new file automatically):
/var/log/myapp/*.log {
daily
rotate 14
compress
create 0640 myapp myapp
}
# Option 3: For rsyslog, use the built-in file output action
# with rotation support (no logrotate needed).
# When copytruncate is unavoidable (app can't reopen files, no signal):
# Accept the small data loss window, but log a note in your runbook.
8. Not compressing old logs¶
Log files compress extremely well (10:1 ratio is common for text logs). Without compression, old logs take 10x the disk space they need to. On a server generating 1GB of logs per day, that's 365GB/year uncompressed vs 36GB compressed.
# Check for uncompressed old logs:
ls -lhS /var/log/*.log.* | grep -v "\.gz$\|\.xz$\|\.bz2$" | head -10
# -rw-r--r-- 1 root root 2.3G Mar 18 00:00 /var/log/syslog.1
# -rw-r--r-- 1 root root 1.8G Mar 17 00:00 /var/log/syslog.2
# Fix — add compress to logrotate config:
/var/log/myapp/*.log {
daily
rotate 30
compress # Compress rotated files
delaycompress # Don't compress the most recent rotation
# (in case a process is still writing to it)
}
# Compress existing old logs:
for f in /var/log/*.log.[0-9]*; do
[[ "$f" == *.gz ]] && continue
gzip "$f" && echo "Compressed: $f"
done
# For really old logs, use xz for better compression:
xz -9 /var/log/old-archive.log # ~15:1 compression ratio
9. Audit log overflow handling¶
The Linux audit system (auditd) has a configurable action when its log fills up or the log partition is full. The default on some systems is to halt the system. Yes — the machine shuts down because a log file is full.
# Check audit configuration:
grep -E "space_left_action|admin_space_left_action|disk_full_action|disk_error_action" /etc/audit/auditd.conf
# space_left_action = SYSLOG
# admin_space_left_action = HALT <-- THIS WILL SHUT DOWN THE SERVER
# disk_full_action = HALT <-- THIS TOO
# disk_error_action = HALT
# Safe settings for most servers:
# /etc/audit/auditd.conf:
space_left_action = SYSLOG # Warn via syslog at first threshold
admin_space_left_action = ROTATE # Rotate logs at critical threshold
disk_full_action = ROTATE # Rotate when disk is full
disk_error_action = SYSLOG # Log and continue on disk errors
# Also set reasonable limits:
max_log_file = 50 # Max 50MB per audit log file
num_logs = 10 # Keep 10 rotated files
max_log_file_action = ROTATE # Rotate when max size reached
# Restart auditd:
systemctl restart auditd
# Monitor audit log size:
du -sh /var/log/audit/
ls -lh /var/log/audit/audit.log
# Check if audit is backlogging (dropping events):
auditctl -s
# backlog_limit 8192
# lost 0 <-- should be 0; non-zero means events were dropped
# backlog 15 <-- current queue depth
10. Logging to a network-mounted filesystem¶
You configure rsyslog to write to a log directory on NFS. The NFS server goes down. rsyslog blocks on every write. All logging stops. Since many services wait for syslog acknowledgment, they start hanging too. The entire server becomes unresponsive because of a remote log destination.
# Symptoms:
# - Services hang or respond slowly
# - 'D' state processes accumulate
# - NFS mount is unresponsive
ps aux | awk '$8 ~ /D/ {print}'
# Shows processes in uninterruptible sleep — likely waiting on NFS
# Prevention:
# 1. Never write primary logs directly to NFS
# 2. Write locally, then forward asynchronously:
# rsyslog async forwarding to remote:
# /etc/rsyslog.d/50-remote.conf:
*.* action(type="omfwd"
target="logserver.example.com"
port="514"
protocol="tcp"
action.resumeRetryCount="-1" # Retry forever
queue.type="LinkedList" # Async queue
queue.filename="remote_fwd" # Disk-assisted queue
queue.maxDiskSpace="1g" # Buffer up to 1GB if remote is down
queue.saveOnShutdown="on" # Don't lose queued messages on restart
)
# 3. If you must use NFS, mount with soft timeout:
# /etc/fstab:
logserver:/logs /mnt/remote-logs nfs soft,timeo=10,retrans=2,nofail 0 0
11. Structured logging mistakes — unparseable formats¶
You configure applications to log in JSON for your log pipeline, but some apps log in plain text, others in different JSON schemas, and some mix JSON with plain text prefixes. Your log parser fails on half the messages, and they end up in a "parse failures" bucket that nobody looks at.
# The mess:
# {"timestamp":"2026-03-19T02:15:33Z","level":"ERROR","msg":"connection refused"}
# Mar 19 02:15:34 server01 myapp[1234]: ERROR connection refused
# 2026-03-19 02:15:35 [ERROR] myapp - {"error":"timeout","details":{"host":"db01"}}
# INFO: Starting up...
# Prevention — standardize on one format:
# 1. Pick a format: JSON lines, one per line
# 2. Document the required fields: timestamp, level, service, message
# 3. Use a logging library that enforces the schema
# 4. Validate in CI/CD: parse sample log output, fail build if invalid
# In rsyslog, use a template for consistent output:
template(name="json-template" type="string"
string="{\"timestamp\":\"%timegenerated:::date-rfc3339%\",\"host\":\"%hostname%\",\"severity\":\"%syslogseverity-text%\",\"facility\":\"%syslogfacility-text%\",\"tag\":\"%syslogtag%\",\"message\":\"%msg:::json%\"}\n"
)
# For apps that can't do JSON, use rsyslog to wrap them:
if $programname == 'legacy-app' then {
action(type="omfile" file="/var/log/legacy-app.json" template="json-template")
stop
}
12. Not monitoring log pipeline health¶
Your log forwarding is broken — rsyslog queue is full, the forwarder crashed, or the central log server rejected messages due to a schema change. You don't notice because nobody monitors the log pipeline itself. A week later, during an incident, you discover you have no logs for the affected timeframe.
# Monitor rsyslog internal stats:
# Enable stats in /etc/rsyslog.conf:
module(load="impstats" interval="60" severity="7" log.syslog="on")
# This logs rsyslog queue depth, message counts, and errors
# Check for forwarding failures:
grep -i "error\|fail\|queue\|suspend" /var/log/syslog | grep rsyslog | tail -20
# Monitor queue depth:
grep "impstats" /var/log/syslog | tail -5
# Look for: queue.size growing, action.failed increasing
# Monitor journal health:
journalctl --verify
# Checks journal file integrity
# Check systemd journal for missed messages:
journalctl -u systemd-journald --since "1 hour ago" | grep -i "suppress\|drop\|miss"
# Set up alerts for:
# - rsyslog queue depth > threshold
# - rsyslog action failures > 0
# - Log volume dropping to zero (no logs = broken pipeline)
# - Journal disk usage approaching SystemMaxUse
# - Time since last log message > expected interval