Skip to content

Thinking Out Loud: Linux Logging

A senior SRE's internal monologue while working through a real logging investigation. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A production server's root partition hit 95% capacity overnight. The alert fired at 4 AM but wasn't paged because it's a warning, not critical. Now it's at 97% and climbing. I need to find what's consuming the space and stop it before the partition fills up completely and causes cascading failures.

The Monologue

Root partition at 97%. First, let me see what's growing. My instinct says logs, but let me verify instead of assuming.

df -h /
du -sh /* 2>/dev/null | sort -rh | head -10

/var is 42GB. On a 50GB root partition, that's the culprit. Let me drill down.

du -sh /var/* 2>/dev/null | sort -rh | head -10

/var/log is 38GB. Logs. Of course. Let me find the biggest offenders.

du -sh /var/log/* 2>/dev/null | sort -rh | head -10

/var/log/app/api.log — 28GB. One single log file, 28 gigabytes. That's the API application's log file. No rotation, no compression. Just a single, ever-growing file.

Mental Model: Log Files Are Unbounded by Default

Applications write logs. They never stop. If nothing rotates, compresses, or truncates those logs, they will eventually fill the disk. This isn't a question of "if" but "when." Every production service needs log rotation configured on day one, not as an afterthought when the disk fills up.

Before I set up rotation, I need to stop the immediate bleeding. I can't just delete the file because the application has it open and the space won't be reclaimed until the file handle is closed. Let me truncate it instead.

ls -lh /var/log/app/api.log

28GB, last modified 2 seconds ago. Still being written to. Let me truncate it.

sudo truncate -s 0 /var/log/app/api.log

Wait — before I do that, let me grab the last 1000 lines in case we need them for recent debugging.

sudo tail -1000 /var/log/app/api.log > /tmp/api-log-tail.txt
sudo truncate -s 0 /var/log/app/api.log

Now let me verify the space was reclaimed.

df -h /

Down to 31%. Good. The truncate works because we're zeroing the same inode — the application's file handle is still valid. If I had done rm followed by touch, the old file would still consume space (held open by the process) and the app would need to be restarted to switch to the new file.

Mental Model: Truncate vs Delete for Open Files

truncate zeros the file in place (same inode). The process keeps writing to the same file descriptor and space is immediately freed. rm removes the directory entry but the disk space isn't freed until the process closes the file handle. You can find these "deleted but open" files with lsof +L1. Always truncate active log files, never delete them.

Now let me set up proper rotation so this doesn't happen again.

cat <<'EOF' | sudo tee /etc/logrotate.d/api-app
/var/log/app/api.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
    maxsize 500M
}
EOF

I'm using copytruncate instead of the default rotate behavior because the application doesn't handle SIGHUP for log reopening. copytruncate copies the file, then truncates the original — so the app's file handle stays valid. There's a tiny window where a few lines might be lost during the copy-truncate, but for application logs, that's acceptable.

Actually, let me check — does this app support SIGHUP for log rotation?

kill -0 $(pgrep -f 'api.server') && head -5 /opt/api/config.yaml | grep -i log

It's a Python uvicorn app writing to stdout, redirected to the log file via systemd. Systemd manages the redirect. Let me check if there's a better approach using journald.

systemctl cat api-server | grep -i log

StandardOutput=file:/var/log/app/api.log. That's the problem — they're using the file: output directive which writes directly to a file, bypassing journald entirely. If they switched to journal (the default), journald would handle rotation automatically.

But changing that requires a service restart, and I don't want to restart the API in production right now. The logrotate config I added will handle it for now. I'll file a ticket to switch to journald output.

Let me also check if journald itself is configured with reasonable limits.

journalctl --disk-usage
cat /etc/systemd/journald.conf | grep -v "^#" | grep -v "^$"

Journald is using 2.1GB, with a 4GB system maximum configured. That's well-managed. If the API was using journald, we'd never have this problem.

Let me test the logrotate config to make sure it works.

sudo logrotate -d /etc/logrotate.d/api-app

Dry run looks good. And one more check — let me make sure the logrotate timer is actually enabled (it's not a cron job anymore in modern systems).

systemctl status logrotate.timer

Active. Good. The timer runs daily and will pick up our new config.

One last thing — let me add a disk usage alert that fires at 85% instead of 95%. A 95% warning with no page is useless — by the time someone sees it, we're already at 97%.

What Made This Senior-Level

Junior Would... Senior Does... Why
rm the big log file truncate the file after saving the tail rm on an open file doesn't free space until the process closes the handle
Set up logrotate and call it done Check whether the app supports SIGHUP, consider journald, and use copytruncate for apps that don't handle log rotation signals The rotation method depends on how the application handles file descriptors
Not check for other "deleted but open" files Know to check lsof +L1 for hidden disk consumers Deleted-but-open files are a common cause of "I deleted stuff but the disk is still full"
Set a single disk alert at 90% Configure tiered alerts (85% warn, 92% page) to catch issues before they become critical The closer you are to 100%, the faster the remaining space fills up

Key Heuristics Used

  1. Truncate, Never Delete Open Files: Use truncate -s 0 for active log files. The file handle stays valid and space is freed immediately.
  2. Log Rotation Is Day-One Infrastructure: Every service needs rotation configured before it goes to production, not as an incident response when the disk fills.
  3. Check the Output Path: Know whether logs go through journald, syslog, or direct file writes — the rotation strategy depends entirely on this.

Cross-References

  • Primer — Linux logging architecture, journald, syslog, and file descriptor basics
  • Street Ops — Log investigation commands, disk usage triage, and logrotate configuration
  • Footguns — Deleting open log files, missing log rotation, and the file: output directive bypassing journald