Thinking Out Loud: Linux Logging¶
A senior SRE's internal monologue while working through a real logging investigation. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
A production server's root partition hit 95% capacity overnight. The alert fired at 4 AM but wasn't paged because it's a warning, not critical. Now it's at 97% and climbing. I need to find what's consuming the space and stop it before the partition fills up completely and causes cascading failures.
The Monologue¶
Root partition at 97%. First, let me see what's growing. My instinct says logs, but let me verify instead of assuming.
/var is 42GB. On a 50GB root partition, that's the culprit. Let me drill down.
/var/log is 38GB. Logs. Of course. Let me find the biggest offenders.
/var/log/app/api.log — 28GB. One single log file, 28 gigabytes. That's the API application's log file. No rotation, no compression. Just a single, ever-growing file.
Mental Model: Log Files Are Unbounded by Default¶
Applications write logs. They never stop. If nothing rotates, compresses, or truncates those logs, they will eventually fill the disk. This isn't a question of "if" but "when." Every production service needs log rotation configured on day one, not as an afterthought when the disk fills up.
Before I set up rotation, I need to stop the immediate bleeding. I can't just delete the file because the application has it open and the space won't be reclaimed until the file handle is closed. Let me truncate it instead.
28GB, last modified 2 seconds ago. Still being written to. Let me truncate it.
Wait — before I do that, let me grab the last 1000 lines in case we need them for recent debugging.
sudo tail -1000 /var/log/app/api.log > /tmp/api-log-tail.txt
sudo truncate -s 0 /var/log/app/api.log
Now let me verify the space was reclaimed.
Down to 31%. Good. The truncate works because we're zeroing the same inode — the application's file handle is still valid. If I had done rm followed by touch, the old file would still consume space (held open by the process) and the app would need to be restarted to switch to the new file.
Mental Model: Truncate vs Delete for Open Files¶
truncatezeros the file in place (same inode). The process keeps writing to the same file descriptor and space is immediately freed.rmremoves the directory entry but the disk space isn't freed until the process closes the file handle. You can find these "deleted but open" files withlsof +L1. Always truncate active log files, never delete them.
Now let me set up proper rotation so this doesn't happen again.
cat <<'EOF' | sudo tee /etc/logrotate.d/api-app
/var/log/app/api.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
copytruncate
maxsize 500M
}
EOF
I'm using copytruncate instead of the default rotate behavior because the application doesn't handle SIGHUP for log reopening. copytruncate copies the file, then truncates the original — so the app's file handle stays valid. There's a tiny window where a few lines might be lost during the copy-truncate, but for application logs, that's acceptable.
Actually, let me check — does this app support SIGHUP for log rotation?
It's a Python uvicorn app writing to stdout, redirected to the log file via systemd. Systemd manages the redirect. Let me check if there's a better approach using journald.
StandardOutput=file:/var/log/app/api.log. That's the problem — they're using the file: output directive which writes directly to a file, bypassing journald entirely. If they switched to journal (the default), journald would handle rotation automatically.
But changing that requires a service restart, and I don't want to restart the API in production right now. The logrotate config I added will handle it for now. I'll file a ticket to switch to journald output.
Let me also check if journald itself is configured with reasonable limits.
Journald is using 2.1GB, with a 4GB system maximum configured. That's well-managed. If the API was using journald, we'd never have this problem.
Let me test the logrotate config to make sure it works.
Dry run looks good. And one more check — let me make sure the logrotate timer is actually enabled (it's not a cron job anymore in modern systems).
Active. Good. The timer runs daily and will pick up our new config.
One last thing — let me add a disk usage alert that fires at 85% instead of 95%. A 95% warning with no page is useless — by the time someone sees it, we're already at 97%.
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
rm the big log file |
truncate the file after saving the tail |
rm on an open file doesn't free space until the process closes the handle |
| Set up logrotate and call it done | Check whether the app supports SIGHUP, consider journald, and use copytruncate for apps that don't handle log rotation signals |
The rotation method depends on how the application handles file descriptors |
| Not check for other "deleted but open" files | Know to check lsof +L1 for hidden disk consumers |
Deleted-but-open files are a common cause of "I deleted stuff but the disk is still full" |
| Set a single disk alert at 90% | Configure tiered alerts (85% warn, 92% page) to catch issues before they become critical | The closer you are to 100%, the faster the remaining space fills up |
Key Heuristics Used¶
- Truncate, Never Delete Open Files: Use
truncate -s 0for active log files. The file handle stays valid and space is freed immediately. - Log Rotation Is Day-One Infrastructure: Every service needs rotation configured before it goes to production, not as an incident response when the disk fills.
- Check the Output Path: Know whether logs go through journald, syslog, or direct file writes — the rotation strategy depends entirely on this.
Cross-References¶
- Primer — Linux logging architecture, journald, syslog, and file descriptor basics
- Street Ops — Log investigation commands, disk usage triage, and logrotate configuration
- Footguns — Deleting open log files, missing log rotation, and the
file:output directive bypassing journald