Capacity Planning — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about capacity planning.

Google's "Borg" paper revealed they run at 60% average utilization¶

Google's 2015 Borg paper revealed that their clusters run at approximately 60% average CPU utilization — remarkably high compared to the industry average of 15-25%. This efficiency gap represents billions of dollars of wasted hardware across the industry and is the primary driver of capacity planning as a discipline.

Most companies over-provision by 3-5x¶

Studies by Gartner and others consistently find that enterprise server workloads use only 20-30% of their provisioned capacity on average. Organizations over-provision by 3-5x because the cost of under-provisioning (outages, revenue loss) is perceived as far greater than the cost of waste, even when it amounts to millions of dollars annually.

Netflix deliberately runs close to capacity limits¶

Netflix practices what they call "right-sizing" — running services as close to their actual resource needs as possible. Combined with Chaos Engineering (randomly terminating instances), this approach ensures their systems can handle failures without maintaining enormous idle capacity buffers that most companies rely on.

The "thundering herd" problem has crashed more services than hardware failures¶

A thundering herd occurs when many clients simultaneously retry after a brief outage, creating a demand spike that prevents recovery. This capacity-related failure pattern has caused more prolonged outages than actual hardware failures. Exponential backoff with jitter — where retries are randomly spread out — is the standard mitigation.

Black Friday capacity planning starts in January¶

Major retailers begin Black Friday capacity planning 10-11 months in advance. Amazon and Walmart run synthetic load tests that simulate 10-20x normal traffic starting in Q2. The infrastructure cost of Black Friday preparation at a top-5 retailer is estimated at $50-100 million annually, much of it for capacity that's used for only 48 hours.

Little's Law from 1961 is still the foundation of capacity math¶

Little's Law (L = lambda * W), published by John Little in 1961, relates the average number of items in a system to the arrival rate and average time spent in the system. This 60-year-old queuing theory result remains the mathematical foundation for calculating server capacity, connection pool sizes, and queue depths in modern systems.

CPU throttling in Kubernetes is worse than running out of memory¶

In Kubernetes, exceeding CPU limits causes throttling (slowed execution), while exceeding memory limits causes OOM kills (process termination). Counter-intuitively, throttling is often worse operationally — it causes unpredictable latency spikes that are hard to diagnose. Many SRE teams now set CPU requests but remove CPU limits entirely.

The "noisy neighbor" problem drove the entire container movement¶

Before containers, co-locating workloads on shared servers frequently caused performance interference — one application's CPU burst would slow down others. This "noisy neighbor" problem was a primary motivation for both VMs and containers. Linux cgroups, which containers use for isolation, were specifically designed to solve this capacity-sharing challenge.

Predictive autoscaling was first implemented at scale by Google in 2014¶

Google was among the first to use ML-based predictive autoscaling, scaling infrastructure up before anticipated demand spikes rather than reactively. They observed that reactive autoscaling had a 5-15 minute lag during which users experienced degraded performance. Predictive scaling reduced this to near-zero for predictable traffic patterns.

Disk I/O is the most commonly overlooked capacity bottleneck¶

While teams obsessively monitor CPU and memory, disk I/O is the most frequently overlooked capacity constraint. A 2022 survey of production incidents found that 23% of performance-related outages were caused by disk I/O saturation — often from logging, temporary files, or database write-ahead logs filling local storage.