The Monitoring Save¶

Category: The Close Call Domains: monitoring, disk-and-storage Read time: ~5 min

Setting the Scene¶

Friday, 4:47 PM. Most of the team had already mentally checked out. I was closing browser tabs and thinking about dinner when my phone buzzed: a PagerDuty alert from our Datadog integration. "Disk usage on payments-api-03 has exceeded 85%. Current: 87%."

Normally, 87% disk usage is a "we should look at this Monday" situation. But this was payments-api-03 — one of four servers handling all credit card processing for our e-commerce platform. Twelve thousand transactions per hour during peak.

What Happened¶

I almost snoozed the alert. It was Friday. I was tired. But something nagged at me — this server had been at 62% disk usage on Wednesday. A 25-point jump in two days was unusual.

I SSH'd in and ran df -h. The /var/log partition was at 91% — it had climbed 4% in the 20 minutes since the alert fired. I ran du -sh /var/log/* and found the culprit: /var/log/payments-debug.log was 38GB and growing.

Someone had merged a PR on Wednesday that turned on debug-level logging for the payment gateway integration. The intent was to troubleshoot a specific intermittent SSL handshake failure. The PR said "temporary, will revert Friday." Nobody reverted it. The log was growing at 2.1GB per hour.

I did the math. The partition had 47GB free on Wednesday. At 2.1GB/hour, we'd hit 100% at roughly 3 AM Saturday. When /var/log fills up on these boxes, the application can't write to its transaction log. When it can't write to its transaction log, it returns 500 errors. Every credit card transaction on our platform would fail.

I ran logrotate --force /etc/logrotate.d/payments to rotate and compress the current log, then truncate -s 0 /var/log/payments-debug.log to zero out the active file. I opened the PR that had enabled debug logging and reverted it with a one-line change: LOG_LEVEL=INFO. Deployed in 12 minutes.

The Moment of Truth¶

If I'd snoozed that alert, the disk would have filled at 3 AM on a Saturday. Our on-call rotation had a gap that weekend — the primary was at a wedding and the secondary hadn't updated their phone number. The payment system would have been down for hours before anyone noticed. Estimated revenue impact: $340,000.

The Aftermath¶

We implemented three changes. First, we added trend-based alerts: if disk usage increases more than 10% in 24 hours, it fires immediately regardless of absolute threshold. Second, all debug logging changes now require a revert deadline enforced by a GitHub Action that auto-creates a revert PR. Third, we added log rotation with a 5GB cap per file and 10GB cap per service.

The Lessons¶

Proactive monitoring pays for itself: That Datadog license costs us $18/server/month. It saved us from a six-figure outage. The ROI calculation is not even close.
Trend-based alerts beat threshold alerts: 87% disk usage is boring. 25-point increase in 48 hours is terrifying. The rate of change tells you more than the absolute number.
Investigate alerts even on Friday: The alert doesn't know it's the weekend. The disk doesn't stop filling because you're tired. If you're on-call, you're on-call.

What I'd Do Differently¶

I'd put log partitions on separate volumes with hard size limits. When /var/log is on the same filesystem as the application, a log explosion can take down the service. Separate volumes turn a logging problem into a logging problem instead of an availability problem.

The Quote¶

"The best monitoring alert is the one you're tempted to ignore but investigate anyway."

Cross-References¶

Topic Packs: Monitoring, Disk and Storage
Case Studies: Disk Capacity Planning (if relevant)