Portal | Level: L2: Operations | Topics: Capacity Planning, SRE Practices, Prometheus | Domain: DevOps & Tooling
Capacity Planning - Primer¶
Why This Matters¶
Every outage post-mortem you've read that says "unexpected traffic spike" is really saying "we didn't plan capacity." Capacity planning is the discipline that prevents your infrastructure from becoming the bottleneck during the moment it matters most — when your product succeeds.
Good capacity planning means you can answer these questions at any time: - When will we run out of X resource at current growth? - How much headroom do we have for a traffic spike right now? - What does the infrastructure cost look like in 6 months?
If you can't answer those, you're flying blind.
The Four Resource Dimensions¶
Every system bottleneck lives in one of four dimensions:
┌──────────────────────────────────────────────────────┐
│ System Resources │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────┐│
│ │ CPU │ │ Memory │ │ Disk │ │ Net ││
│ │ │ │ │ │ (IOPS & │ │ ││
│ │ compute │ │ capacity │ │ BW) │ │ BW & ││
│ │ cycles │ │ & speed │ │ │ │ PPS ││
│ └──────────┘ └──────────┘ └──────────┘ └──────┘│
└──────────────────────────────────────────────────────┘
| Dimension | Metrics to Track | Common Bottleneck Symptom |
|---|---|---|
| CPU | Utilization %, steal %, load average | High latency, slow response |
| Memory | Used/available, swap usage, OOM kills | OOM kills, heavy swapping |
| Disk | IOPS, throughput (MB/s), latency, queue | Slow writes, I/O wait |
| Network | Bandwidth (Mbps), packets/sec, errors | Timeouts, dropped connections |
Utilization vs Saturation — The Critical Distinction¶
These are not the same thing, and confusing them will wreck your capacity model.
Utilization: What percentage of the resource is being used. - "CPU is at 60% utilization" means 60% of cycles are doing work.
Saturation: Whether work is queuing because the resource can't keep up. - "CPU has 14 tasks in the run queue" means processes are waiting.
Utilization: 70% Utilization: 70%
Saturation: 0 Saturation: HIGH
┌──────────────────┐ ┌──────────────────┐
│ ████████████░░░░ │ │ ████████████░░░░ │
│ │ │ Queue: ████████ │
│ System is fine │ │ System is hurting│
└──────────────────┘ └──────────────────┘
A system can be at 70% utilization with zero saturation (steady workload, no queuing) or 70% utilization with massive saturation (bursty workload, requests pile up between bursts).
Plan on saturation, not utilization. A system that hits 100% utilization for 200ms during a burst will queue requests even if the average utilization is 40%.
Analogy: Utilization vs saturation is like a highway. A highway at 70% utilization with evenly spaced cars flows smoothly. A highway at 70% average utilization with rush-hour bursts has bumper-to-bumper traffic for hours. Same average, completely different experience. This is why averages lie in capacity planning.
Forecasting Methods¶
Linear Extrapolation¶
The simplest method: draw a line through historical data and extend it.
Usage
^
│ ╱ (projected)
│ ╱ ╱
│ ╱ ╱
│ ╱ ╱
│ ╱ ╱
│ ╱
├─────────────────────────── Capacity limit
│
└──────────────────────────────> Time
Now Exhaustion
# Prometheus: predict_linear forecasts when a metric will hit a value
# "When will disk fill up based on last 7 days?"
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 30*24*3600) < 0
# Returns negative if disk will be full within 30 days
When it works: Steady, consistent growth (e.g., database size on a mature product).
When it lies: Anything with seasonal patterns, step-function growth (marketing campaigns), or logarithmic saturation.
Seasonal Decomposition¶
Most traffic has patterns: daily peaks, weekly cycles, monthly billing runs.
Requests/sec
^
│ ╱╲ ╱╲ ╱╲
│ ╱ ╲ ╱ ╲ ╱ ╲ ← daily peak (2 PM)
│ ╱ ╲ ╱ ╲ ╱ ╲
│ ╱ ╲╱ ╲╱ ╲
│ ← baseline grows over weeks
└──────────────────────────────> Time
Mon Tue Wed Thu
Decompose your metric into: 1. Trend — the long-term direction (growing, flat, declining) 2. Seasonal — the repeating pattern (daily, weekly, monthly) 3. Residual — the noise
Plan capacity for: trend + seasonal_peak + safety_margin
Growth Rate Modeling¶
Current usage: 800 req/s peak
Monthly growth: 12%
Planning horizon: 6 months
Month 1: 800 * 1.12 = 896 req/s
Month 2: 896 * 1.12 = 1,003 req/s
Month 3: 1003 * 1.12 = 1,124 req/s
Month 4: 1124 * 1.12 = 1,259 req/s
Month 5: 1259 * 1.12 = 1,410 req/s
Month 6: 1410 * 1.12 = 1,579 req/s
Compound growth is deceptive. 12% monthly = 2x in 6 months = 4x in a year.
Remember: The Rule of 72 -- divide 72 by your growth rate percentage to get the doubling time. 12% monthly growth: 72/12 = 6 months to double. 10% monthly: 72/10 = 7.2 months. This mental math shortcut lets you estimate exhaustion dates in your head during meetings.
Formula: future = current * (1 + growth_rate) ^ months
Headroom Planning¶
Headroom is the buffer between your current peak usage and your capacity limit. You need it for:
- Organic growth between capacity reviews
- Traffic spikes (flash sales, news coverage, DDoS)
- Failure scenarios (lose one node, traffic redistributes to survivors)
- Operational overhead (deployments, compactions, migrations)
The N+1 / N+2 Model¶
┌──────────────────────────────────────────┐
│ Cluster: 4 nodes, each handles 1000 rps │
│ Total capacity: 4000 rps │
│ Current peak: 2800 rps │
│ │
│ N+1: Can survive 1 node failure │
│ → 3 nodes must handle 2800 rps │
│ → Each node: 933 rps (93% util) ← tight! │
│ │
│ N+2: Can survive 2 node failures │
│ → 2 nodes must handle 2800 rps │
│ → Each node: 1400 rps ← over capacity! │
│ │
│ Decision: Add nodes so N+1 is safe │
│ → 5 nodes: each handles 700 rps at peak │
│ → Lose 1: 4 nodes at 700 rps = fine │
└──────────────────────────────────────────┘
Headroom Rules of Thumb¶
| Resource | Target Max Utilization | Why |
|---|---|---|
| CPU | 60-70% | Burst headroom, GC pauses, deployment |
| Memory | 70-80% | Page cache, fork overhead, safety |
| Disk | 70-75% | Compaction, log spikes, recovery space |
| Network | 50-60% | Retransmissions, burst absorption |
These are starting points. Tune based on your workload's burstiness.
Burst Capacity¶
Sustained throughput and burst throughput are different numbers:
# Example: A 4-core system
# Sustained: 2000 req/s (50% CPU, steady state)
# Burst (10s): 3500 req/s (90% CPU, queues build)
# Burst (60s): 2800 req/s (70% CPU, some queueing)
# Burst (5min): 2400 req/s (caches warm, 60% CPU)
Design for: - Sustained capacity > normal peak traffic - Burst capacity > 2x normal peak (for flash crowds) - Drain rate > arrival rate (queues must eventually empty)
If your burst capacity equals your sustained capacity, you have no shock absorber.
The Capacity Planning Process¶
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Measure │ ──▶ │ 2. Model │ ──▶ │ 3. Predict │
│ Current │ │ Workload │ │ Future │
│ Usage │ │ Drivers │ │ Demand │
└─────────────┘ └──────────────┘ └──────────────┘
│ │
│ ▼
┌──────▼──────┐ ┌──────────────┐
│ 6. Iterate │ ◀────────────────────── │ 4. Plan │
│ (quarterly) │ │ Supply │
└─────────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ 5. Execute │
│ (procure, │
│ scale) │
└──────────────┘
Step 1: Measure¶
Instrument everything. At minimum:
- CPU utilization (per-core and aggregate)
- Memory used/available (not just "free")
- Disk IOPS + throughput + latency + space
- Network bandwidth + packet rate + errors
- Application-level: requests/sec, latency p50/p95/p99, error rate
- Queue depths: connection pool, message queues, thread pools
Step 2: Model Workload Drivers¶
Find what drives your resource consumption:
"1 active user = 3 req/s = 0.002 CPU cores = 50MB RAM"
"1 message published = 0.5ms CPU + 4KB disk write + 1KB network"
Step 3: Predict Future Demand¶
Current: 10,000 active users → 30,000 req/s → 20 CPU cores
Growth: +15% users/month
In 6 months: 23,000 users → 69,000 req/s → 46 CPU cores
Add headroom (30%): 60 CPU cores needed
Step 4: Plan Supply¶
Match infrastructure to predicted demand. Factor in: - Lead time for procurement (cloud: minutes; on-prem: weeks to months) - Cost optimization (reserved instances, committed use discounts) - Step-function scaling (you can't buy half a server)
Right-Sizing Containers¶
Containers make capacity planning both easier (flexible) and harder (death by a thousand paper cuts).
# Prometheus: Find containers that request 2 CPU but use 0.3
avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/
avg(kube_pod_container_resource_requests{resource="cpu"}) by (pod)
# If this ratio is < 0.3, the container is massively over-provisioned
# Memory: actual vs requested
avg(container_memory_working_set_bytes) by (pod)
/
avg(kube_pod_container_resource_requests{resource="memory"}) by (pod)
Right-sizing rules: - Requests = p99 of actual usage + 20% headroom - Limits = requests * 1.5 (or 2x for bursty workloads) - Review monthly. Usage patterns shift.
The Capacity Planning Spreadsheet¶
Even with fancy tools, a spreadsheet model often communicates best to leadership:
| Resource | Current | Peak | Capacity | Headroom | Exhaust Date |
|---------------|---------|-------|----------|----------|--------------|
| API CPU | 42 cores| 58 c | 80 cores | 27% | Aug 2026 |
| DB Memory | 180 GB | 210 GB| 256 GB | 18% | Jun 2026 |
| Disk (data) | 3.2 TB | - | 5 TB | 36% | Nov 2026 |
| Disk IOPS | 12,000 | 18,000| 25,000 | 28% | Sep 2026 |
| Network (ext) | 2.4 Gbps| 4.1 Gb| 10 Gbps | 59% | 2027+ |
Update this quarterly. Present it to leadership. The resource with the earliest exhaust date is your priority.
Interview tip: When asked about capacity planning, lead with the exhaust-date table. Interviewers want to see that you can translate technical metrics into business-relevant timelines. "We'll run out of database memory in June" is actionable; "memory is at 70%" is not.
Key Takeaways¶
- Measure all four dimensions: CPU, memory, disk, network. Your bottleneck will be the one you forgot.
- Plan on peaks and saturation, never averages. Averages hide the pain.
- Compound growth is deceptive — 10% monthly is 3x annually.
- Headroom is not waste. It's your buffer for spikes, failures, and operations.
- N+1 minimum for any service that matters. Prove it by testing a node failure.
- Right-size containers by measuring actual usage, not guessing.
- A simple spreadsheet with exhaust dates communicates better than a Grafana dashboard to the people who approve budgets.
Fun fact: Google's original "USE method" (Utilization, Saturation, Errors) for capacity analysis was formalized by Brendan Gregg at Netflix. It provides a systematic checklist: for every resource, check utilization, saturation, and errors. This prevents the common mistake of only checking the resource you suspect while the real bottleneck hides in a dimension you forgot.
Wiki Navigation¶
Prerequisites¶
- SRE Practices (Topic Pack, L2)
- Observability Deep Dive (Topic Pack, L2)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Prometheus
- Alerting Rules (Topic Pack, L2) — Prometheus
- Alerting Rules Drills (Drill, L2) — Prometheus
- Capacity Planning Flashcards (CLI) (flashcard_deck, L1) — Capacity Planning
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Prometheus
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Prometheus
- Change Management (Topic Pack, L1) — SRE Practices
- Datadog Flashcards (CLI) (flashcard_deck, L1) — Prometheus
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — Prometheus
- Interview: Prometheus Target Down (Scenario, L2) — Prometheus
Pages that link here¶
- Adversarial Interview Gauntlet
- Alerting Rules
- Alerting Rules Drills
- Anti-Primer: Capacity Planning
- Capacity Planning
- Change Management
- Incident Replay: Power Supply Redundancy Lost
- Incident Replay: Rack PDU Overload Alert
- Level 7: SRE & Cloud Operations
- Master Curriculum: 40 Weeks
- Observability Deep Dive
- SRE Practices
- Scenario: Prometheus Says Target Down
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Symptoms: Grafana Dashboard Empty, Prometheus Scrape Blocked by NetworkPolicy