Skip to content

Capacity Planning: Math Before Midnight

  • lesson
  • capacity-planning
  • queueing-theory
  • use-method
  • linux-performance
  • cloud-cost-optimization
  • prometheus
  • kubernetes-resource-management
  • disaster-capacity
  • l2 ---# Capacity Planning: Math Before Midnight

Topics: capacity planning, queueing theory, USE method, Linux performance, cloud cost optimization, Prometheus, Kubernetes resource management, disaster capacity Level: L2 (Operations) Time: 60–75 minutes Prerequisites: None (everything explained from scratch)


The Mission

It's October 14th. Black Friday is six weeks away. Your product manager just forwarded a message from the VP of Sales: "We're projecting 5x normal traffic this year. Marketing is buying Super Bowl pre-game ads and we've partnered with three influencers."

Your engineering manager's Slack message to you: "Can our infra handle 5x? I need a confidence level by Friday."

You don't have a capacity model. You have dashboards, gut feelings, and a vague memory of someone saying "we're fine until next year" six months ago. You have five days to answer the question with math, not hope.

By the end of this lesson, you'll know how to: - Measure what you actually have (the four resources, the USE method) - Calculate what you'll need (Little's Law, napkin math, growth modeling) - Estimate what it'll cost (cloud right-sizing, reserved vs on-demand) - Plan for what goes wrong (N+1, N+2, disaster capacity) - Set up monitoring that warns you before the math stops working


Part 1: Know What You Have

Before you can predict the future, you need to measure the present. Every system bottleneck lives in one of four dimensions.

The four resources

Resource What to measure Tool Bottleneck feels like
CPU Utilization %, steal %, load average mpstat -P ALL 1 Slow responses, high latency
Memory Used/available, swap usage, OOM kills free -h, /proc/meminfo OOM kills, swapping
Disk IOPS, throughput (MB/s), latency, queue depth iostat -xz 1 I/O wait, slow writes
Network Bandwidth (Mbps), packets/sec, errors sar -n DEV 1 Timeouts, dropped connections

Run all four right now on one of your production nodes. Takes 30 seconds:

# The 30-second capacity snapshot
mpstat -P ALL 1 3          # CPU per core, 3 samples
free -h                     # Memory at a glance
iostat -xz 1 3             # Disk IOPS, throughput, queue depth
sar -n DEV 1 3             # Network throughput per interface

Each command gives you 3 one-second samples. Not enough for capacity planning, but enough to know which dimension to worry about first.

Mental Model: Think of your server as a four-lane highway. CPU is lane 1, memory is lane 2, disk is lane 3, network is lane 4. Traffic flows through all four lanes simultaneously. A traffic jam in any single lane slows down the whole road. Capacity planning means measuring all four lanes, not just the one you think is busy.

The USE method: utilization, saturation, errors

Brendan Gregg formalized the USE method while at Joyent in 2012 and published it in Systems Performance: Enterprise and the Cloud (Prentice Hall, 2013). For every resource, check three things:

Check Question Why it matters
Utilization How busy is it? (% time doing work) High util means less room for spikes
Saturation Is work queueing? Queued work = users waiting
Errors Are things failing? Errors waste capacity and mask real limits

Here's the critical distinction that trips people up:

Server A:  CPU utilization 70%,  run queue depth: 0
           → Working hard, nobody waiting. Fine.

Server B:  CPU utilization 70%,  run queue depth: 14 (on a 4-core box)
           → Same utilization, but 14 processes are waiting.
           → Users feel this as latency spikes.

Same utilization number. Completely different user experience. A highway at 70% average utilization with evenly spaced cars flows smoothly. A highway at 70% average utilization with rush-hour bursts has bumper-to-bumper traffic. Same average, different reality.

Always check saturation alongside utilization. Here's the USE checklist for Linux:

# CPU
mpstat -P ALL 1 5         # Utilization: %usr + %sys per core
vmstat 1 5                # Saturation: 'r' column (run queue depth)
dmesg | grep -i "cpu"     # Errors: throttling, MCE events

# Memory
free -h                   # Utilization: used / total
vmstat 1 5                # Saturation: 'si'/'so' columns (swap in/out)
dmesg | grep -i "oom"     # Errors: OOM killer events

# Disk
iostat -xz 1 5            # Utilization: %util column
iostat -xz 1 5            # Saturation: avgqu-sz (avg queue size)
dmesg | grep -i "error"   # Errors: I/O errors, sector failures

# Network
sar -n DEV 1 5            # Utilization: rxkB/s, txkB/s vs link speed
ss -s                     # Saturation: socket backlog, overflows
ip -s link show eth0      # Errors: RX/TX errors, drops

Trivia: Google's 2015 Borg paper revealed they run clusters at approximately 60% average CPU utilization. The industry average is 15-25%. That gap represents billions of dollars of wasted hardware across the industry. Most companies over-provision by 3-5x because the cost of an outage is perceived as far worse than the cost of waste.


Flashcard check #1

Question Answer
Name the four resource dimensions in capacity planning CPU, memory, disk, network
What does the "S" in USE stand for? Saturation — how much extra work is queued
A server shows 65% CPU utilization and a run queue of 0. Is it saturated? No — work is flowing, nothing is queued
A server shows 65% CPU utilization and a run queue of 18. Is it saturated? Yes — 18 processes are waiting despite the same utilization
Which Linux command shows disk queue depth? iostat -xz (the avgqu-sz column)

Part 2: The Math That Predicts the Future

You've measured the present. Now you need to calculate whether 5x traffic will fit.

Little's Law: the one equation you need

In 1961, John D.C. Little proved a relationship so elegant it works for any stable system — a grocery store checkout line, a web server connection pool, a Kafka consumer group:

L = λ × W

L = average number of items in the system (concurrency)
λ = arrival rate (requests per second)
W = average time each item spends in the system (latency)

That's it. Three variables. If you know two, you can calculate the third.

Name Origin: John Dutton Conant Little is an MIT professor who published his proof in the Journal of Operations Research in 1961. The theorem is remarkable because it requires almost no assumptions about the distribution of arrival times or service times — it holds for virtually any queueing system in steady state.

Applied to your web service:

Your API currently handles 2,000 requests per second with an average response time of 50ms. How many concurrent connections do you need?

L = λ × W
L = 2,000 req/s × 0.050 s
L = 100 concurrent connections

Your connection pool has 200 connections. So you have 2x headroom at current traffic. What happens at 5x traffic?

L = 10,000 req/s × 0.050 s
L = 500 concurrent connections

Your 200-connection pool is blown. You either need 500 connections — or you need to reduce W (make responses faster), or you need more instances to spread the lambda.

But wait. Under load, latency doesn't stay constant. At 5x traffic, your average response time might climb from 50ms to 200ms because the database is saturated:

L = 10,000 req/s × 0.200 s
L = 2,000 concurrent connections

Now you need 2,000 connections. This is how capacity problems cascade — one dimension (database latency) multiplies the pressure on another (connection pool).

Interview Bridge: Little's Law is a classic systems design interview question. "How would you size a connection pool?" or "How many workers do you need?" always comes back to L = lambda * W. If an interviewer asks about capacity, start here.

The napkin math approach

You don't always need Prometheus. Sometimes you need a number in the next five minutes, during a meeting, on the back of a napkin. Here's how to think in orders of magnitude.

Useful reference numbers to memorize:

Network:
  1 Gbps  ≈ 125 MB/s ≈ 120,000 req/s at 1 KB each

Disk:
  HDD:    100-200 random IOPS,   100-200 MB/s sequential
  SSD:    10,000-100,000 IOPS,   500-3,500 MB/s sequential
  NVMe:   100,000-1,000,000 IOPS, 3,000-7,000 MB/s sequential

Memory:
  DDR4:   ~25 GB/s per channel, 2-4 channels = 50-100 GB/s

Connections:
  1 TCP connection ≈ 3.5 KB kernel memory (established state)
  1M connections ≈ 3.5 GB kernel memory (sockets alone)
  Real per-connection overhead with app state: 10-50 KB each

Kubernetes:
  30-110 pods per node (depends on CNI and IP range)
  Each pod: ~256 KB-1 MB kubelet overhead

Remember: The Rule of 72 — divide 72 by your growth rate percentage to get the doubling time. Growing at 12% per month? 72/12 = 6 months to double. 10% monthly? 72/10 = 7.2 months. This shortcut lets you estimate exhaustion dates in your head during meetings without a calculator.

Napkin math walkthrough for the Black Friday problem:

Current state:
  Peak traffic: 2,000 req/s
  API servers: 5 instances, each handles 800 req/s before degradation
  Total capacity: 5 × 800 = 4,000 req/s
  Current headroom: 4,000 / 2,000 = 2x

Black Friday projection (5x):
  Expected traffic: 10,000 req/s
  Current capacity: 4,000 req/s
  Shortfall: 10,000 - 4,000 = 6,000 req/s
  Additional instances needed: 6,000 / 800 = 7.5 → round up to 8
  Total instances needed: 5 + 8 = 13

But wait — what about N+1?
  With 13 instances, losing 1 leaves 12:
  12 × 800 = 9,600 req/s < 10,000 req/s  ← not safe
  Need 14 instances for N+1 safety at 5x traffic.

That's your napkin answer: 14 instances. Did it in 90 seconds. Now you can have an informed conversation about cost and timeline before running a formal model.


Queue theory: why things get worse fast

There's a reason systems feel fine at 60% and terrible at 85%. Queueing theory explains why.

For a system with random arrivals (like web traffic), the average queue length follows this curve:

Average queue length vs utilization (M/M/1 queue):

Queue     ┃
Length    ┃                                    ╱
          ┃                                  ╱
  20      ┃                                ╱
          ┃                              ╱
  15      ┃                            ╱
          ┃                          ╱
  10      ┃                        ╱
          ┃                     ╱
   5      ┃                  ╱
          ┃             ╱╱
   2      ┃         ╱╱
   1      ┃    ╱╱╱
          ┃╱╱╱
          ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          0%  20%  40%  60%  80%  90% 100%
                     Utilization

Formula:  queue_length = utilization / (1 - utilization)
Utilization Average queue length Relative wait
50% 1.0 baseline
70% 2.3 2.3x
80% 4.0 4.0x
85% 5.7 5.7x
90% 9.0 9.0x
95% 19.0 19.0x

This is why the "80% rule" exists for capacity planning. The jump from 80% to 90% utilization doesn't add 12% more queue — it adds 125% more queue (from 4.0 to 9.0). The system doesn't degrade linearly. It hits a cliff.

Gotcha: This model (M/M/1) assumes random arrivals and a single server. Real web traffic is burstier than random, and real systems have multiple servers. Bursty traffic makes the curve worse (queue builds faster). Multiple servers help (each one operates at lower utilization). The shape of the curve is universal — the exact numbers depend on your workload.

Growth modeling: linear, exponential, step functions

Not all growth looks the same, and using the wrong model will wreck your forecast.

LINEAR:        Usage grows by a fixed amount each period.
               "We add 500 GB of data per month."
               Predictable, easy to plan. predict_linear() works well.

EXPONENTIAL:   Usage grows by a percentage each period.
               "User base grows 12% monthly."
               Deceptive: 12% monthly = 2x in 6 months = 4x in a year.
               predict_linear() will be too optimistic.

STEP FUNCTION: Usage jumps when events happen.
               "New customer onboards 50,000 users on day one."
               Can't predict with time-series analysis.
               Requires business intel (sales pipeline, launch dates).

Compound growth math for your Black Friday planning:

Current peak:     2,000 req/s
Monthly growth:   10%
Months to Black Friday: 1.5

Month 1:   2,000 × 1.10 = 2,200 req/s (organic growth)
Month 1.5: 2,200 × 1.05 = 2,310 req/s (organic at Black Friday)
5x spike:  2,310 × 5    = 11,550 req/s (Black Friday peak)

Notice organic growth added 310 req/s to your baseline. At 5x, that's an extra 1,550 req/s you'd miss if you planned 5x on today's numbers instead of Black Friday's.


Flashcard check #2

Question Answer
State Little's Law L = lambda * W (concurrency = arrival rate * average time in system)
Your API handles 5,000 req/s at 80ms avg latency. How many concurrent connections? L = 5,000 * 0.08 = 400 concurrent connections
Using the Rule of 72: how long to double at 8% monthly growth? 72 / 8 = 9 months
At 90% utilization with random arrivals, what's the average queue length? 9.0 (from utilization / (1 - utilization))
Why does predict_linear() lie for exponential growth? It fits a straight line to a curve — it will underestimate future usage, predicting you have more time than you do

Part 3: What Will It Cost?

You've done the math. You need 14 instances instead of 5. Your manager's next question: "How much?"

Cloud right-sizing: the request/limit gap

In Kubernetes, you pay for what you request, not what you use. This is the single biggest cost optimization opportunity.

[Actually Used: 100m CPU]  [Requested: 500m CPU]  [Limit: 1000m CPU]
     |----- waste ------|

You're paying for 500m. You're using 100m. That's 80% waste.

Before adding 9 new instances, check if your existing ones are over-provisioned:

# Prometheus: Find the request/usage ratio
avg(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
avg(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# If this ratio is < 0.3, containers are massively over-provisioned.
# Right-size first, THEN add instances.

Right-sizing formula:

CPU request  = p99 actual usage × 1.2  (20% headroom)
CPU limit    = CPU request × 1.5       (burst allowance)
Mem request  = p99 actual usage × 1.2
Mem limit    = Mem request × 1.3       (less burst — OOM is a hard kill)

Gotcha: Kubernetes CPU throttling is often worse operationally than OOM kills. Exceeding a CPU limit causes throttling — your process runs but unpredictably slowly, causing latency spikes that are hard to diagnose. Exceeding a memory limit causes an OOM kill — loud, obvious, restarts cleanly. Many SRE teams now set CPU requests but remove CPU limits entirely. This is a deliberate tradeoff, not an oversight.

Worked cost example

Let's price out the Black Friday capacity on AWS:

Current:  5 × m6i.xlarge (4 vCPU, 16 GB) = $0.192/hr × 5 = $0.96/hr
          Monthly: $0.96 × 730 = $700/month

Black Friday target: 14 × m6i.xlarge
  Option A: All on-demand
    $0.192 × 14 = $2.69/hr → $1,963/month
    But you only need 14 for ~2 weeks around Black Friday.
    Extra 9 instances for 14 days: $0.192 × 9 × 24 × 14 = $580

  Option B: Right-size existing + burst with spot
    Right-size reveals you can handle 1,000 req/s per instance (not 800)
    Now need 11 instances, not 14
    Run 5 on-demand (base) + 6 spot instances for burst
    Spot price for m6i.xlarge: ~$0.058/hr (70% discount)
    Extra 6 spot for 14 days: $0.058 × 6 × 24 × 14 = $117

  Option C: Reserved instances for base + on-demand for burst
    5 reserved (1-year): $0.121/hr × 5 = $441/month (saves $259/month base)
    Extra 9 on-demand for 14 days: $580
Option Monthly base Black Friday extra Annual total
A: All on-demand $700 $580 $8,980
B: Right-size + spot $700 $117 $8,517
C: Reserved + on-demand burst $441 $580 $5,872
D: Reserved + right-size + spot $441 $117 $5,409

Option D saves $3,571/year over option A — and it starts with right-sizing, which is free.

Trivia: Netflix practices "right-sizing" — running services as close to actual resource needs as possible. Combined with Chaos Engineering (randomly killing instances), they ensure systems handle failure without maintaining large idle buffers. Most companies don't have Netflix's confidence, which is why they over-provision by 3-5x.


Part 4: What About Data Growth?

Here's where capacity planning gets people. You planned for traffic. You forgot about data.

War Story: A mid-size e-commerce platform sized their infrastructure perfectly for Black Friday 2023 traffic. 5x the web servers, 3x the cache layer, load-tested everything. Black Friday came, traffic surged, the app tier handled it beautifully. Then the order processing pipeline ground to a halt. The issue: every order generated 47 database writes across 8 tables (order, line items, inventory, payment, shipping, audit log, analytics events, search index updates). At 5x orders, the database write throughput hit the IOPS ceiling of their EBS volumes. They had planned for 5x reads (which the cache handled) but not for 5x writes (which bypassed the cache entirely). The fix was emergency IOPS provisioning at 3 AM — at premium pricing. The lesson: traffic capacity and data capacity are different problems. Plan both.

Disk: the dimension everyone forgets

Teams obsessively monitor CPU and memory. Disk I/O is the most commonly overlooked capacity bottleneck. A 2022 survey of production incidents found that 23% of performance-related outages were caused by disk I/O saturation — from logging, temp files, or database WAL.

# The disk capacity quad: space, IOPS, throughput, latency
# All four matter. Teams usually only track space.

iostat -xz 1 5
#
# Key columns:
# r/s, w/s     → read/write IOPS
# rMB/s, wMB/s → read/write throughput
# await        → average I/O latency (ms)
# avgqu-sz     → average queue depth (saturation!)
# %util        → utilization

Gotcha: AWS EBS gp3 volumes provide 3,000 baseline IOPS regardless of size. But gp2 volumes provide 3 IOPS per GB — a 100 GB gp2 volume only gets 300 IOPS. If you migrated from gp2 to gp3 without checking provisioned IOPS, you might have accidentally improved things (3,000 vs 300) or lost burst credits that gp2 provided. Always verify with aws ec2 describe-volumes.

Data growth forecasting

# Prometheus: disk growth rate in bytes per day
rate(node_filesystem_size_bytes{mountpoint="/data"}[7d]) * 86400

# predict_linear: when will this disk be full?
predict_linear(node_filesystem_avail_bytes{mountpoint="/data"}[7d], 30*24*3600) < 0
# Returns negative if disk fills within 30 days

But predict_linear assumes linear growth. E-commerce data growth is bursty — flat weekdays, spikes on weekends, massive during promotions. For Black Friday specifically:

Current data growth:  5 GB/day (normal)
Black Friday week:    5 GB/day × 5 (traffic multiplier) = 25 GB/day × 7 = 175 GB
Pre-BF sales week:    5 GB/day × 2 = 10 GB/day × 7 = 70 GB

Total extra disk needed for the 2-week window: ~245 GB
Current free space: 800 GB

Looks fine... until you account for the database:
  WAL (write-ahead log) during peak: 3x normal size
  Temporary sort/join files during analytics queries: up to 50 GB
  Vacuum/compaction needing 20% free space to operate

Actual free space needed: 245 GB (data) + 150 GB (WAL) + 50 GB (temp) + 200 GB (vacuum)
                        = 645 GB of the 800 GB free
                        = 80% utilization → entering the danger zone

That "800 GB free" felt comfortable until you modeled the actual workload. This is why napkin math matters — it surfaces problems you'd never find by staring at a dashboard.


Flashcard check #3

Question Answer
In Kubernetes, do you pay for resource requests or resource usage? Requests — over-requesting means paying for idle resources
What's the right-sizing formula for CPU requests? p99 actual usage * 1.2 (20% headroom)
Why is CPU throttling in Kubernetes sometimes worse than OOM kills? Throttling causes unpredictable latency (hard to diagnose). OOM is loud, obvious, and restarts cleanly.
Name the four disk metrics (not just space) Space, IOPS, throughput (MB/s), latency
Why does predict_linear underestimate Black Friday data growth? It assumes linear growth but event-driven data is bursty — flat normally, massive during promotions

Part 5: Planning for Failure

You've sized for 5x traffic. You've accounted for data growth. Now: what happens when something breaks during the 5x surge?

N+1 and N+2: disaster capacity

N+1: The system survives the loss of one component at peak load.
N+2: The system survives the loss of two components at peak load.

Applied to your Black Friday cluster of 14 instances (each handling 800 req/s):

Total capacity: 14 × 800 = 11,200 req/s
Black Friday peak: 11,550 req/s

Wait. 11,200 < 11,550. You're ALREADY under capacity before anything breaks.

With N+1 (lose 1 instance):
  13 × 800 = 10,400 req/s → 1,150 req/s shortfall → degraded

With N+2 (lose 2 instances):
  12 × 800 = 9,600 req/s → 1,950 req/s shortfall → outage

Fix: 15 instances gives 12,000 req/s.
  N+1: 14 × 800 = 11,200 req/s → still tight but survivable.
  Better: 16 instances.
  N+1: 15 × 800 = 12,000 req/s → comfortable.
  N+2: 14 × 800 = 11,200 req/s → tight but alive.

This is the headroom calculation that turns "we should be fine" into "we will be fine."

Headroom targets by resource

Resource Target max utilization at peak Why that number
CPU 60-70% Burst headroom, GC pauses, deployment overhead
Memory 70-80% Page cache, fork overhead, safety margin
Disk 70-75% Compaction, log spikes, recovery space
Network 50-60% Retransmissions, burst absorption

These are starting points. Bursty workloads need more headroom. Steady workloads can run tighter. The queueing theory curve from Part 2 explains why: above 80%, queue length explodes.

Gotcha: Your cluster runs at 85% capacity during normal peak. A node fails, traffic redistributes to survivors, and they hit 100%. Now you have a cascading failure — high load causes health check timeouts, which causes more evictions, which causes more redistribution. This is how one failed node takes down an entire cluster. N+1 planning prevents this cascade.

Auto-scaling: not a capacity plan

"We're on AWS, we can just scale." This is the most dangerous sentence in capacity planning.

Auto-scaling limitations: - Lag: 3-7 minutes from metric trigger to running instance. At 5x traffic, that's 3-7 minutes of degradation. - Service quotas: AWS has account-level limits on EC2 instances, vCPUs, IPs. If you've never requested an increase, you might hit a hard ceiling at 2x. - Database can't auto-scale: Your app tier scales, traffic hits the database at 5x, and the database becomes the bottleneck no amount of app tier scaling can fix. - Cold caches: New instances start with empty caches. The first requests hit the database directly, making the database problem worse. - Step-function pricing: You can't spin up half a node. You either have 10 instances or 11.

Mental Model: Auto-scaling is a seatbelt, not a steering wheel. It protects you from unexpected bumps, but it can't substitute for knowing where you're going. Use it for organic variance (+/- 20%). For known events (Black Friday, product launches), pre-scale.

# Prometheus alert: auto-scaling approaching service quota limits
- alert: EC2InstanceQuotaApproaching
  expr: |
    aws_servicequotas_value{quota_name="Running On-Demand Standard instances"}
    * 0.8
    <
    count(up{job="ec2-node-exporter"})
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "EC2 instance count approaching service quota limit"

Part 6: Monitoring That Warns You First

The math is done. Now wire it into monitoring so you don't have to redo it by hand every week.

Prometheus recording rules for capacity

Recording rules pre-compute expensive queries so your dashboards stay fast and your alerts stay reliable.

groups:
  - name: capacity_planning
    interval: 5m
    rules:
      # CPU headroom: how much room before 80% utilization?
      - record: capacity:cpu_headroom_ratio
        expr: |
          1 - (
            avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
          ) / 0.80

      # Memory headroom
      - record: capacity:memory_headroom_ratio
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
          / 0.20

      # Disk exhaustion prediction (days until full)
      - record: capacity:disk_days_remaining
        expr: |
          (
            node_filesystem_avail_bytes{fstype=~"ext4|xfs"}
            / (
              rate(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d])
              * -86400
            )
          )

      # Requests per CPU core (efficiency metric)
      - record: capacity:requests_per_cpu_core
        expr: |
          sum(rate(http_requests_total[5m]))
          /
          sum(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Predictive alerts

# Alert: disk will fill within 7 days
- alert: DiskSpaceExhaustionPredicted
  expr: |
    predict_linear(
      node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d],
      7 * 24 * 3600
    ) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in ~7 days"

# Alert: memory exhaustion within 3 days
- alert: MemoryExhaustionPredicted
  expr: |
    predict_linear(
      node_memory_MemAvailable_bytes[3d],
      3 * 24 * 3600
    ) < 0
  for: 1h
  labels:
    severity: warning

Gotcha: predict_linear assumes linear growth. If your data grows exponentially, the prediction will be too optimistic — it'll tell you that you have more time than you actually do. For exponential growth, use the Rule of 72 as a sanity check against what predict_linear reports.

The capacity review cadence

Weekly:   Glance at headroom dashboards. Any resource below 30% headroom?
Monthly:  Compare actual growth to last month's projection. Recalibrate.
Quarterly: Full capacity review with leadership. Present the exhaust-date table.
Pre-event: 6-8 weeks before any known traffic event. Run load tests.
Post-event: Within 1 week. What was actual vs projected? Update the model.

The quarterly review output should be a simple table that anyone can read:

| Resource      | Current Peak | Capacity | Headroom | Exhaust Date |
|---------------|-------------|----------|----------|--------------|
| API CPU       | 58 cores    | 80 cores | 27%      | Aug 2026     |
| DB Memory     | 210 GB      | 256 GB   | 18%      | Jun 2026     |
| Disk (data)   | 3.2 TB      | 5 TB     | 36%      | Nov 2026     |
| Disk IOPS     | 18,000      | 25,000   | 28%      | Sep 2026     |
| Network (ext) | 4.1 Gbps    | 10 Gbps  | 59%      | 2027+        |

The resource with the earliest exhaust date is your priority. In this table: DB memory in June. That's your next action item.

Interview Bridge: When asked about capacity planning in an interview, lead with the exhaust-date table. Interviewers want to see that you translate technical metrics into business-relevant timelines. "We'll run out of database memory in June" is actionable; "memory is at 70%" is not.


Flashcard check #4

Question Answer
What does N+1 mean in disaster capacity? The system survives the loss of one component at peak load
Why is auto-scaling not a substitute for capacity planning? It has 3-7 minute lag, hits service quotas, can't scale databases, and starts with cold caches
What Prometheus function predicts when a metric will cross a threshold? predict_linear()
In a quarterly capacity review, what's the most important number? The exhaust date — which resource runs out first and when
What's the recommended max CPU utilization for capacity headroom? 60-70% (to absorb bursts, GC pauses, and deployment overhead)

Exercises

Exercise 1: The napkin (2 minutes)

Your service handles 3,000 req/s at 40ms average latency. Each instance handles 600 req/s. You have 8 instances.

Calculate: 1. Current concurrent connections (Little's Law) 2. Total capacity 3. Current headroom (multiple) 4. Instances needed for 4x traffic with N+1 safety

Solution
1. L = 3,000 × 0.040 = 120 concurrent connections
2. Total capacity = 8 × 600 = 4,800 req/s
3. Headroom = 4,800 / 3,000 = 1.6x
4. 4x traffic = 12,000 req/s
   Instances needed = 12,000 / 600 = 20
   N+1 = 21 instances (losing 1 leaves 20 × 600 = 12,000 — just barely enough)
   Safer: 22 instances (losing 1 leaves 21 × 600 = 12,600 — comfortable)

Exercise 2: The queueing cliff (5 minutes)

Using the formula queue_length = utilization / (1 - utilization):

  1. Calculate queue length at 50%, 75%, 85%, 90%, and 95% utilization
  2. Plot them (mentally or on paper)
  3. At what utilization does queue length exceed 10?
  4. If your SLA requires average queue length under 5, what's your maximum utilization?
Solution
50%: 0.50 / 0.50 = 1.0
75%: 0.75 / 0.25 = 3.0
85%: 0.85 / 0.15 = 5.7
90%: 0.90 / 0.10 = 9.0
95%: 0.95 / 0.05 = 19.0

Queue exceeds 10 between 90% and 91% utilization.
  91%: 0.91 / 0.09 = 10.1 ← there it is.

For queue < 5: solve util / (1 - util) < 5
  util < 5 - 5*util → 6*util < 5 → util < 0.833
  Maximum utilization: ~83%

Exercise 3: The cost comparison (10 minutes)

Your team runs 20 m6i.2xlarge instances (8 vCPU, 32 GB) on AWS on-demand at $0.384/hr. Prometheus shows average CPU usage is 25% and average memory usage is 40%.

  1. What are you spending monthly?
  2. If you right-size to actual usage (p99 + 20% headroom), what instance type fits?
  3. What's the savings from right-sizing alone?
  4. What additional savings come from 1-year reserved pricing?
Solution
1. $0.384 × 20 × 730 hours = $5,606/month

2. Actual usage:
   CPU: 25% of 8 vCPU = 2 vCPU → with 20% headroom = 2.4 vCPU → 4 vCPU instance
   Memory: 40% of 32 GB = 12.8 GB → with 20% headroom = 15.4 GB → 16 GB instance
   Fits: m6i.xlarge (4 vCPU, 16 GB) at $0.192/hr

3. Right-sized: $0.192 × 20 × 730 = $2,803/month
   Savings: $5,606 - $2,803 = $2,803/month (50% reduction)

4. m6i.xlarge 1-year reserved: ~$0.121/hr
   Reserved: $0.121 × 20 × 730 = $1,767/month
   Total savings vs original: $5,606 - $1,767 = $3,839/month = $46,068/year

Exercise 4: The Black Friday plan (15 minutes)

Write a one-page capacity plan for the following system:

  • API tier: 10 pods, 500m CPU request, 1Gi memory request
  • Database: RDS db.r6g.xlarge (4 vCPU, 32 GB), 3000 provisioned IOPS
  • Cache: ElastiCache r6g.large (2 vCPU, 13 GB)
  • Current peak: 5,000 req/s at 60ms average latency
  • Projected Black Friday: 8x traffic
  • Monthly growth: 6%

Your plan should include: instances needed (with N+1), connection pool sizing (Little's Law), database IOPS needed, estimated cost, and the single biggest risk.

Hints - Start with Little's Law for connection pool sizing at 8x - Don't forget that latency increases under load (estimate 2x-3x) - Database IOPS scale roughly linearly with write traffic - The biggest risk is almost always the database tier - Use the Rule of 72 for organic growth by Black Friday

Cheat Sheet

Concept Formula / Rule Quick reference
Little's Law L = lambda * W Concurrency = arrival rate * avg latency
Rule of 72 72 / growth_rate% = doubling time 12% monthly → doubles in 6 months
Queue length util / (1 - util) At 90% util → queue of 9
N+1 capacity (N-1) * per_instance_cap > peak Losing 1 instance must still serve peak
Right-sizing p99_usage * 1.2 = request 20% headroom on 99th percentile
Exhaustion date (capacity - peak) / monthly_growth Months until resource runs out
Headroom CPU 60-70%, Mem 70-80%, Disk 70-75%, Net 50-60% Max utilization targets at peak
USE method Utilization, Saturation, Errors Check all three for every resource

Key Prometheus queries:

Need Query
Disk fill prediction predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
CPU headroom 1 - avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Container right-sizing ratio actual_cpu / requested_cpu — below 0.3 means over-provisioned
Request rate rate(http_requests_total[5m])

Napkin reference numbers:

Resource Ballpark
1 Gbps ~125 MB/s, ~120K req/s at 1 KB
SSD 10K-100K random IOPS
NVMe 100K-1M random IOPS
1 TCP connection ~3.5 KB kernel memory
1M connections ~3.5 GB kernel + 10-50 GB app memory

Takeaways

  1. Measure all four dimensions. CPU, memory, disk, network. Your bottleneck will be the one you forgot to check — and it's probably disk IOPS.

  2. Plan on peaks and saturation, never averages. A system at 40% average utilization can be saturated during bursts. Use percentiles, not means.

  3. Little's Law is the Swiss army knife. L = lambda * W connects concurrency, throughput, and latency. Use it for connection pools, thread pools, queue sizing — anything with arrivals and departures.

  4. The queueing cliff is real. Systems don't degrade linearly. Going from 80% to 90% utilization more than doubles your queue length. The 80% rule exists for a reason.

  5. Traffic capacity and data capacity are different problems. You can scale your app tier to 10x and still hit the IOPS ceiling on your database volumes. Plan both.

  6. Auto-scaling is a seatbelt, not a steering wheel. Use it for organic variance. For known events, pre-scale. Know your service quotas.