Skip to content

Capacity Planning - Street-Level Ops

Quick Diagnosis Commands

# ── CPU ──
mpstat -P ALL 1 5                       # Per-core CPU utilization (5 samples)
uptime                                  # Load averages (1, 5, 15 min)
sar -u 1 10                             # CPU utilization over 10 seconds
pidstat -u 1 5                          # Per-process CPU usage

# ── Memory ──
free -h                                 # Memory summary (human-readable)
vmstat 1 5                              # Memory, swap, I/O, CPU in 1-sec intervals
cat /proc/meminfo | head -20            # Detailed memory breakdown
slabtop -o                              # Kernel slab cache usage

# ── Disk ──
iostat -xz 1 5                          # Disk IOPS, throughput, latency, queue depth
df -h                                   # Filesystem space usage
df -i                                   # Inode usage (can exhaust before space)
lsblk                                   # Block device layout

# ── Network ──
sar -n DEV 1 5                          # Network throughput per interface
ss -s                                   # Socket statistics summary
nstat                                   # Kernel network counters
ip -s link show eth0                    # Interface stats (errors, drops)

# ── Application Level ──
# Prometheus instant queries (via curl or UI)
# Request rate: rate(http_requests_total[5m])
# Latency p99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Pattern: Prometheus-Based Forecasting with predict_linear

predict_linear is your most useful capacity planning function. It uses linear regression on a time range to predict future values.

# Alert: disk will be full within 7 days
- alert: DiskSpaceExhaustionPredicted
  expr: |
    predict_linear(
      node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d],
      7 * 24 * 3600
    ) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in ~7 days"

# Alert: memory exhaustion within 3 days
- alert: MemoryExhaustionPredicted
  expr: |
    predict_linear(
      node_memory_MemAvailable_bytes[3d],
      3 * 24 * 3600
    ) < 0
  for: 1h
  labels:
    severity: warning

# Dashboard query: when will CPU hit 80%?
# "Days until 80% CPU utilization"
(0.80 - avg(rate(node_cpu_seconds_total{mode!="idle"}[1h])))
/
deriv(avg(rate(node_cpu_seconds_total{mode!="idle"}[1h]))[7d:1h])
/ 86400

Gotcha: predict_linear assumes linear growth. If your growth is exponential, the prediction will be too optimistic (it'll say you have more time than you do).

Debug clue: If predict_linear gives wildly wrong results, check your input range. A 7-day window with a one-time spike (like a backup job) will skew the linear regression. Use a longer window (14-30 days) to smooth out periodic spikes, or pre-filter with avg_over_time before feeding into predict_linear.


Pattern: Load Testing to Establish Capacity Baseline

You can't plan capacity if you don't know your system's limits:

# Step 1: Establish baseline with hey (HTTP load generator)
hey -z 60s -c 50 -q 100 https://api.example.com/health
# 50 concurrent connections, 100 req/s for 60 seconds

# Step 2: Ramp up until latency degrades
for rate in 200 400 600 800 1000 1200; do
  echo "=== Testing $rate req/s ==="
  hey -z 30s -c 100 -q $rate https://api.example.com/endpoint 2>&1 | \
    grep -E '(Requests/sec|Average|99%|Status)'
  sleep 10  # cool down
done

# Step 3: Record the inflection point
# Example results:
#  400 rps → p99: 45ms  (healthy)
#  600 rps → p99: 52ms  (healthy)
#  800 rps → p99: 120ms (latency climbing)
# 1000 rps → p99: 450ms (saturated)
# 1200 rps → p99: 2.1s  (degraded)
#
# Conclusion: Capacity = ~800 req/s per instance before degradation

Document results in a table:

| Load (rps) | p50 (ms) | p99 (ms) | CPU % | Error Rate |
|------------|----------|----------|-------|------------|
| 200        | 12       | 28       | 15%   | 0%         |
| 400        | 14       | 45       | 30%   | 0%         |
| 600        | 18       | 52       | 48%   | 0%         |
| 800        | 25       | 120      | 65%   | 0.1%       |
| 1000       | 85       | 450      | 82%   | 1.2%       |
| 1200       | 310      | 2100     | 95%   | 8.5%       |


Pattern: Right-Sizing Kubernetes Containers

# Step 1: Get actual CPU usage (last 7 days, 99th percentile)
# PromQL:
quantile_over_time(0.99,
  rate(container_cpu_usage_seconds_total{
    namespace="production",
    container!="POD"
  }[5m])[7d:5m]
) by (pod, container)

# Step 2: Compare to requests
kube_pod_container_resource_requests{resource="cpu", namespace="production"}

# Step 3: Kubectl for quick spot-check
kubectl top pods -n production --sort-by=cpu
kubectl top pods -n production --sort-by=memory

# Step 4: VPA recommendations (if VPA is installed)
kubectl get vpa -n production -o yaml | grep -A 10 recommendation

Right-sizing formula:

CPU request  = p99 actual usage * 1.2 (20% headroom)
CPU limit    = CPU request * 1.5 (burst allowance)
Mem request  = p99 actual usage * 1.2
Mem limit    = Mem request * 1.3 (less burst needed; OOM is hard failure)


Gotcha: Monitoring Averages Instead of Percentiles

Your monitoring shows "average response time: 45ms." Looks great. But 5% of requests take 3 seconds, and those users are churning.

# Bad: average
avg(rate(http_request_duration_seconds_sum[5m]))
/
avg(rate(http_request_duration_seconds_count[5m]))

# Good: percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

Capacity plan on p99, not p50. If p99 is high, you need more capacity even if the average looks fine.

Remember: Averages hide outliers. A service with 45ms average and 3s p99 means 1 in 100 users waits 60x longer than the median user. If each user session makes 20 API calls, the probability of hitting at least one slow request per session is 1 - 0.99^20 = 18%. Nearly one in five users experiences degraded performance that the average never shows.


Pattern: Correlating Business Metrics to Infrastructure

The most useful capacity model links business metrics to resources:

# Step 1: Find the correlation
# Plot "active users" against "CPU cores used" over 30 days
# Usually linear within a range

# Step 2: Establish the ratio
# Example: 1000 active users = 2.3 CPU cores = 4.1 GB RAM

# Step 3: Build the model
# Current: 50,000 users → 115 CPU cores → 205 GB RAM
# Growth: +8% users/month
# In 6 months: 79,000 users → 182 CPU cores → 324 GB RAM

# Step 4: Prometheus recording rules for the ratio
# record: capacity:cpu_per_1k_users
# expr: |
#   sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))
#   /
#   (sum(active_users_gauge) / 1000)

Pattern: Disk Growth Tracking

# Track disk usage daily with a cron job
# /usr/local/bin/disk-tracker.sh
#!/bin/bash
DATE=$(date +%Y-%m-%d)
for MOUNT in / /var /data; do
  USED=$(df --output=used -B1 "$MOUNT" | tail -1)
  echo "$DATE,$MOUNT,$USED" >> /var/log/disk-growth.csv
done

# Analyze: calculate daily growth rate
# awk -F',' '$2=="/data" {if(prev) print $1, ($3-prev)/1073741824, "GB/day"; prev=$3}' \
#   /var/log/disk-growth.csv

# Prometheus approach (better):
# rate(node_filesystem_size_bytes{mountpoint="/data"}[7d]) * 86400
# Gives bytes/day growth rate

Gotcha: Ignoring Inode Exhaustion

Disk has 500GB free but you can't create files. Inodes are exhausted — common with millions of small files (mail queues, session files, container layers).

# Check inode usage
df -i

# Find directories with massive file counts
find /var -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20

# Prevention: monitor inode usage alongside space
# Prometheus: node_filesystem_files_free / node_filesystem_files * 100

Pattern: Quick Capacity Math

Useful back-of-envelope calculations:

# Network bandwidth
1 Gbps = ~125 MB/s = ~120,000 req/s at 1KB each

# Disk IOPS
HDD: 100-200 random IOPS, 100-200 MB/s sequential
SSD: 10,000-100,000 random IOPS, 500-3,500 MB/s sequential
NVMe: 100,000-1,000,000 IOPS, 3,000-7,000 MB/s sequential

# Memory bandwidth
DDR4: ~25 GB/s per channel, 2-4 channels = 50-100 GB/s

# Connection limits
Linux default: ~28,000 ephemeral ports (tune net.ipv4.ip_local_port_range)
1 TCP connection = ~3.5KB kernel memory (established state)
1M connections = ~3.5 GB kernel memory for sockets alone

# Pod density
Rule of thumb: 30-110 pods per Kubernetes node (depends on CNI + IP range)
Each pod: ~256KB-1MB kubelet overhead

Scale note: The "1M connections = 3.5 GB kernel memory" calculation only covers socket buffers. Each connection also consumes application-level memory (goroutine stacks, request buffers, connection pool entries). Real-world per-connection overhead is typically 10-50 KB depending on the application, making 1M connections cost 10-50 GB of application memory.


Pattern: The Quarterly Capacity Review

Agenda:
1. Current state  — resource utilization summary, hotspots
2. Growth review   — actual growth vs last quarter's prediction
3. Forecast update — next 3/6/12 month projections
4. Bottleneck ID   — which resource hits the wall first?
5. Action items    — procure, scale, optimize, or defer

Inputs needed:
- Prometheus/Grafana dashboards for the quarter
- Business metrics (user growth, transaction volume)
- Cost reports (cloud spend by service)
- Upcoming product launches or campaigns

Output:
- Updated capacity spreadsheet with exhaust dates
- Purchase/scaling requests with justification
- Optimization opportunities (right-sizing, cleanup)