The Memory Leak Marathon¶

Category: The Incident Domains: linux-performance, containers Read time: ~5 min

Setting the Scene¶

I was the SRE lead at a ride-sharing company in a mid-size city. Not Uber — think 40 engineers, 200,000 monthly active users, and a core matching service written in Go that ran on Kubernetes. The matching service had to stay up: when it went down, riders couldn't get matched with drivers, and the whole business stopped. It ran as a Deployment with 6 replicas, each pod limited to 2Gi memory. It had been running fine for a year, with deploys roughly every two weeks.

What Happened¶

Week 1 — We start getting occasional OOMKilled pods during Friday and Saturday evening peaks. Not every weekend, just sometimes. Kubernetes restarts them, traffic rebalances, and nobody notices except me looking at the pod restart metrics on Monday morning. I file a low-priority ticket: "investigate occasional OOMKills on matching service."

Week 3 — The OOMKills are happening more frequently. I graph memory usage over 30 days in Grafana. There's a clear upward trend: each pod starts at about 800Mi after deploy and grows roughly 15Mi per day. After about 12 days, pods are at 1.2Gi. After 20 days, they're approaching the 2Gi limit. The leak is slow — only visible at the timescale of weeks.

Week 4 — I try to reproduce it in staging. I deploy the same code, run load tests for 8 hours. Memory grows from 800Mi to 850Mi. That's consistent with the leak rate, but it's so small that a normal staging test (20 minutes of load testing) would never catch it.

Week 4, Friday 9:15 PM — Peak traffic. Three of six pods OOMKill simultaneously. The surviving three pods absorb the traffic, but their memory usage spikes from the increased load, and within ten minutes, two more OOMKill. We're down to one pod serving all traffic. Response times go from 200ms to 4 seconds. Riders are churning.

Week 4, Friday 9:20 PM — I manually kill the remaining pod to force all six to restart fresh. They come up at 800Mi each, traffic stabilizes. The outage lasted about 8 minutes. But it'll happen again in two weeks unless we find the leak.

Week 5 — I attach pprof to one production pod (Go's built-in profiler). I take heap snapshots every 6 hours for three days. The profile shows a growing number of goroutine objects holding references to completed ride matches. The matching engine creates a goroutine for each match request, but under a specific error condition — when a driver cancels during the matching window — the goroutine's context never gets cancelled and the match result stays in memory forever. It's about 60 KB per leaked goroutine. At our volume, that's roughly 250 leaks per day per pod.

The Moment of Truth¶

The leak was conditional: it only happened on driver cancellations during active matching, which was about 2% of all requests. In staging, our load tests used synthetic data with zero cancellations. The bug was invisible in any environment that didn't behave like production over a multi-day window.

The Aftermath¶

We fixed the goroutine leak — a missing defer cancel() on a context — and deployed it. Memory usage flatlined at 800Mi per pod. But the real changes were operational. We added a weekly soak test in staging: 48 hours of load with realistic cancellation patterns, monitoring for memory growth. We set up memory trend alerting — not just "is memory high?" but "is memory growing linearly over 72 hours?" which catches leaks before they hit limits. We also reduced the pod memory limit to 1.5Gi so leaks would surface faster (OOMKill at day 8 instead of day 20).

The Lessons¶

Long-running soak tests catch what unit tests can't: A memory leak that grows at 15Mi/day is invisible in a 20-minute CI pipeline. Soak tests with realistic data patterns and multi-day runtimes are essential for long-running services.
Memory profiling in production is not optional: Go's pprof, Java's async-profiler, Python's tracemalloc — these tools exist to be used in production. Continuous profiling services like Pyroscope make this even easier.
Don't ignore slow resource growth: A linear memory increase is always a leak. Set up trend-based alerting that detects steady growth over days, not just threshold-based alerts that only fire at the crisis point.

What I'd Do Differently¶

I'd run continuous profiling from day one using Pyroscope or Grafana's continuous profiling integration. Having historical heap profiles makes leak investigations trivial — you can compare profiles across days and see exactly what's growing. I'd also build cancellation and error scenarios into every load test by default, not just happy-path requests.

The Quote¶

"The leak was 60 kilobytes at a time. It took three weeks to become a crisis and five minutes to fix. The hard part was the three weeks in between."

Cross-References¶

Topic Packs: Linux Performance, Containers Deep Dive, Continuous Profiling, Linux Memory Management
Case Studies: Kubernetes Ops