Skip to content

The Log That Filled the Disk

Category: The Incident Domains: logging, linux-ops Read time: ~5 min


Setting the Scene

Small B2B SaaS company, about 25 engineers. I was the only person with "ops" in their title, which meant I was the entire infrastructure team. We ran on six bare-metal servers leased from a hosting provider — no cloud, no containers, just good old Ubuntu 20.04 and systemd. Our main app was a Django monolith behind Gunicorn and Nginx. I'd set up rsyslog, logrotate, the whole nine yards. Or so I thought.

What Happened

Tuesday 2:00 PM — I get a Slack message from a developer: "My deploy just failed with 'No space left on device.'" I SSH in. df -h shows / at 100%. My heart rate goes up.

2:05 PM — I run du -sh /var/log/* and find it: application.log is 47 gigabytes. On a 50 GB root partition. The log file is larger than the rest of the operating system combined.

2:08 PM — I check logrotate. It's configured for /var/log/app/*.log. Our application writes to /var/log/application.log. One character difference in the path. Logrotate has been happily rotating nothing for seven months while the actual log file grew without bound.

2:10 PM — But why 47 GB? On a normal day we generate maybe 200 MB of logs. I tail the file and see it: DEBUG level logging. Every single database query, every cache hit, every template render, every middleware call — all being logged at DEBUG level. Someone had set LOG_LEVEL=DEBUG in the systemd environment file three weeks ago while troubleshooting a bug and never changed it back.

2:15 PM — The disk is full. Services are failing. I can't even write a temporary file to run a script. I truncate the log: > /var/log/application.log. Forty-seven gigabytes freed instantly. Services start recovering.

2:18 PM — But now we have cascading problems. During the disk-full period, PostgreSQL couldn't write WAL files. It went into recovery mode. Redis couldn't persist its AOF file — it's running but has lost about 20 minutes of session data. Nginx couldn't write to its access log and started returning 500 errors because of the error_log directive failing.

2:30 PM — PostgreSQL finishes recovery. I restart Nginx. Redis rebuilds from its last RDB snapshot. Everything is up. I fix the logrotate path, set log level back to WARNING in the systemd env file, and add a 10 GB quota on the log file using logrotate's maxsize directive.

2:45 PM — I sit in my chair and contemplate how a mistyped directory name in logrotate cost us 30 minutes of downtime and some lost session data.

The Moment of Truth

/var/log/app/*.log versus /var/log/application.log. That was it. Seven months of log file growth, invisible until it ate the entire disk. And the DEBUG logging tripled the growth rate from manageable-but-still-wrong to catastrophic.

The Aftermath

I moved /var/log to its own partition so a runaway log file could never kill the root filesystem again. I added disk usage monitoring with alerts at 70% and 90% on every partition. I wrote a pre-deploy check that verifies LOG_LEVEL is not DEBUG in production. And I added logrotate validation to our config management — a nightly script that checks every logrotate config against actual files on disk and alerts if there's a mismatch.

The Lessons

  1. Log rotation is not optional — and must match reality: Logrotate configs must reference the actual paths your application writes to. Validate this, don't assume it.
  2. Separate log partitions: A runaway log should fill /var/log, not /. Partition isolation prevents a logging problem from becoming a system-wide catastrophe.
  3. Debug logging must have auto-expiry: If someone enables DEBUG in production, it should automatically revert after a configurable window (1 hour, max). Never let debug logging persist indefinitely in production.

What I'd Do Differently

I'd implement structured logging to a centralized system (like the ELK stack or Loki) from day one, with local log files being ephemeral and aggressively rotated. The application should not depend on local disk for log persistence. I'd also use systemd's LogLevelMax directive to enforce a ceiling on log verbosity at the service level, independent of application config.

The Quote

"Forty-seven gigabytes of logs told us everything about the application except the one thing that mattered: 'you're about to run out of disk.'"

Cross-References