Capacity Planning Footguns¶
- Planning capacity on averages instead of peaks. Your average CPU is 35%. Plenty of room, right? But during the daily peak from 11 AM to 1 PM, CPU hits 88% and requests queue. Users experience slow load times for 2 hours every day, but your dashboard's 24-hour average looks fine.
Fix: Always plan on the 95th or 99th percentile of your peak window,
not the daily or weekly average. Use max_over_time() or
quantile_over_time(0.95, ...) in Prometheus. If your peak exceeds 70%
CPU, you need more capacity now.
- Ignoring seasonality in growth projections. You see 15% growth in January and extrapolate 15% monthly for the year. But January had a promotional campaign. February through April are flat. You over-provision by 3x and waste budget, or you see flat months and under-invest, then get crushed by the next campaign.
Fix: Use at least 12 months of data for projections. Decompose traffic into trend + seasonal pattern. Separate organic growth from event-driven spikes. Model them independently.
- Confusing utilization with saturation.
Your system shows 60% CPU utilization. Seems healthy. But
vmstatshows a run queue of 12 on a 4-core system. Processes are waiting. Latency is high. Utilization says "fine" while the system is actually saturated during bursts.
Fix: Track saturation metrics alongside utilization: run queue depth (CPU), swap activity (memory), I/O queue depth (disk), socket backlog (network). A system with 60% average utilization but bursty saturation needs more capacity.
Under the hood: This is Brendan Gregg's USE Method (Utilization, Saturation, Errors) in action. Utilization is what percentage of time the resource is busy. Saturation is how much extra work is queued. A disk at 60% utilization with an I/O queue depth of 30 is saturated — requests are waiting.
iostat -x 1shows both:%util(utilization) andavgqu-sz(queue depth/saturation).
- No buffer for incident response. Your cluster runs at 85% capacity during normal peak. A node fails, traffic redistributes, and the remaining nodes hit 100%. Now you have a cascading failure — high load causes health check timeouts, which causes more evictions, which causes more redistribution.
Fix: Maintain enough headroom that losing one node (N+1) keeps remaining nodes below 70% at peak. For critical services, plan for N+2. Test this by actually draining a node during peak hours.
- Treating disk space as the only disk metric. You have 2TB free on your database volume. Plenty of room. But your disk IOPS are maxed at 3,000 and your queries are waiting on I/O. You add more disk space when what you needed was faster storage.
Fix: Track all four disk dimensions: space, IOPS (random read/write), throughput (sequential MB/s), and latency. Usually IOPS or latency hits the wall before space does, especially on cloud volumes where IOPS is tied to provisioned size.
Default trap: AWS EBS
gp3volumes provide 3,000 baseline IOPS regardless of size — butgp2volumes provide 3 IOPS per GB, so a 100GBgp2volume only gets 300 IOPS. If you migrated fromgp2togp3without adjusting provisioned IOPS, you might have 3,000 IOPS (an upgrade) or you might have lost burst credits thatgp2provided. Always check your volume type and provisioned IOPS withaws ec2 describe-volumes.
- Forgetting to account for operational overhead. Your cluster can handle 10,000 rps across 5 nodes at 80% CPU. But during a rolling deployment, one node is unavailable. During compaction, disk I/O doubles temporarily. During backup, network throughput spikes. These operations eat into your headroom silently.
Fix: Include operational activities in your capacity model: deployments, backups, compaction, log rotation, health checks. Measure resource usage during these operations and add it to your peak load model.
- Right-sizing once and never revisiting. Six months ago you sized your containers at 500m CPU / 512Mi memory based on careful measurement. Since then, the team added three API endpoints, an audit logging middleware, and switched JSON libraries. Actual usage shifted but nobody updated the resource specs.
Fix: Review container and VM sizing quarterly. Automate the review with VPA recommendations or custom Prometheus queries comparing requests vs actual usage. Sizing is a continuous process, not a one-time event.
- Using cloud provider "unlimited" as a capacity plan. "We're on AWS, we can just scale." But auto-scaling has lag (minutes), service quotas exist (EC2 limits, API rate limits, IP pools), and your application may not scale linearly. The database becomes the bottleneck that no amount of app tier scaling can fix.
Fix: Know your service quotas and request increases proactively. Test auto-scaling under realistic load. Identify which tier is the actual bottleneck (usually the database). Have a documented capacity model even in the cloud.
- Ignoring memory fragmentation and cache pressure.
freeshows 4GB available. But your application allocates in a pattern that fragments memory,mallocstarts failing or slowing down, and the kernel's page cache gets evicted. Application performance drops even though "plenty of memory" is available.
Fix: Monitor MemAvailable (not MemFree), slab cache growth, and
page reclaim activity (pgmajfault in /proc/vmstat). For JVM and similar
runtimes, monitor GC pause times — they spike when the OS is reclaiming
pages underneath.
-
No capacity plan at all — reacting instead of forecasting. You add capacity when things break. Every scaling event is an emergency. Procurement takes 2 weeks, so you're degraded for 2 weeks. You never have time to optimize because you're always firefighting the next bottleneck.
Fix: Start simple. A spreadsheet with current usage, peak usage, capacity, and a linear exhaust date for each major resource. Update it monthly. Present it quarterly. This alone puts you ahead of 80% of teams.
Remember: The simplest capacity plan is a "time to exhaustion" calculation for each major resource:
(total_capacity - current_peak_usage) / monthly_growth_rate = months_until_full. If any resource shows less than 3 months, act now. If less than 6 months, plan now. This one formula, applied to CPU, memory, disk, and network, catches 90% of capacity surprises.