Skip to content

Decision Tree: Scale Up or Optimize First?

Category: Operational Decisions Starting Question: "The system is struggling under load — should I scale up or optimize?" Estimated traversal: 3-5 minutes Domains: performance, capacity-planning, SRE, infrastructure, profiling


The Tree

The system is struggling under load — scale up or optimize first?
├── [Check 1] Is there active production impact RIGHT NOW?
│   │         (users experiencing errors, latency > SLO, revenue path degraded)
│   │
│   ├── YES — production is impacted
│   │   ├── → ✅ SCALE IMMEDIATELY — do not wait to optimize
│   │   │   Scaling is reversible; user impact is not
│   │   │
│   │   └── [Check 2] CAN you scale? (autoscaler headroom, node capacity, quota)
│   │       ├── YES → scale now, profile and optimize in parallel (see terminal action)
│   │       └── NO (quota exhausted, no nodes available, DB can't scale horizontally)
│   │           ├── [Check 3] What is the bottleneck?
│   │           │   ├── CPU → can you shed load? (circuit breaker, rate limit, queue)
│   │           │   ├── Memory → can you reduce object retention? (GC tuning, buffer sizes)
│   │           │   ├── DB connections → can you add a connection pooler (PgBouncer)?
│   │           │   ├── I/O → can you add a read replica or cache layer?
│   │           │   └── Network → check for N+1 queries or chatty protocols
│   │           └── → ✅ EMERGENCY OPTIMIZATION (targeted, fastest path to relief)
│   │
│   └── NO — system is struggling but not yet impacted
│       (latency trending up, error budget burn accelerating, capacity alert)
│       │
│       ├── [Check 4] What is the bottleneck resource?
│       │   │
│       │   ├── CPU bound (CPU > 70% sustained, CPU throttling in cgroups)
│       │   │   ├── [Check 5] Is this application code or infrastructure?
│       │   │   │   ├── Application code (profiler shows hot path, inefficient algorithm)
│       │   │   │   │   ├── [Check 6] Is the fix scoped and estimated < 3 days of work?
│       │   │   │   │   │   ├── YES → ✅ PROFILE AND FIX (optimize first)
│       │   │   │   │   │   └── NO (requires architectural change, months of work)
│       │   │   │   │   │       └── → ✅ SCALE NOW, put optimization on roadmap
│       │   │   │   └── Infrastructure (serialization overhead, GC pressure, syscall volume)
│       │   │   │       └── → ✅ SCALE + INVESTIGATE (infrastructure tuning takes time)
│       │   │
│       │   ├── Memory bound (OOM kills, swap usage, GC frequency high)
│       │   │   ├── [Check 7] Is there a memory leak? (memory grows unbounded over time)
│       │   │   │   ├── YES (heap dumps show growing retention) → ✅ FIX THE LEAK first
│       │   │   │   │   Scaling just delays the OOM; the leak will fill any amount of RAM
│       │   │   │   └── NO (steady-state usage is just high)
│       │   │   │       └── → ✅ SCALE MEMORY + set resource limits + alert on trend
│       │   │
│       │   ├── I/O bound (disk or network saturated, high I/O wait, slow DB queries)
│       │   │   ├── [Check 8] Is there a specific query or code path consuming > 80% of I/O?
│       │   │   │   ├── YES (single query, hot table, missing index identified)
│       │   │   │   │   └── → ✅ OPTIMIZE FIRST (targeted fix, high ROI, faster than scaling DB)
│       │   │   │   └── NO (I/O spread across many operations)
│       │   │   │       ├── [Check 9] Is load temporary or permanent?
│       │   │   │       │   ├── Temporary spike (event, batch, seasonal)
│       │   │   │       │   │   └── → ✅ SCALE for spike + optimize during quiet period
│       │   │   │       │   └── Permanent growth (organic traffic increase)
│       │   │   │       │       └── → ✅ SCALE + CACHE HOT DATA + alert on disk trend
│       │   │
│       │   └── Lock / contention bound (DB deadlocks, mutex contention, queue saturation)
│       │       ├── [Check 10] Is the contention in application code or DB schema?
│       │       │   ├── Application code (single mutex, synchronous queue)
│       │       │   │   └── → ✅ OPTIMIZE FIRST — horizontal scaling doesn't help contention
│       │       │   └── DB schema (hot row, table-level lock, missing index on join column)
│       │       │       └── → ✅ DB QUERY OPTIMIZATION before scaling DB
│       │
│       └── [Check 11] Has optimization been attempted before?
│           ├── YES (profiling done, quick wins taken, known architecture limits)
│           │   └── → ✅ SCALE — you've hit the optimization ceiling for now
│           └── NO (no profiling data, operating on assumptions)
│               └── → ✅ PROFILE FIRST (don't scale what you haven't measured)
│                   Spend 2 hours profiling before spending money on scaling

Node Details

Check 1: Active production impact

Command/method:

# Real-time error rate
kubectl exec -it prometheus-pod -- promtool query instant \
  'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

# P95 latency vs SLO
kubectl exec -it prometheus-pod -- promtool query instant \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
# Compare to SLO threshold (e.g., < 500ms)

# SLO burn rate — are we bleeding error budget fast?
kubectl exec -it prometheus-pod -- promtool query instant \
  '(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / 0.001'

# Check HPA — is autoscaler already at max replicas?
kubectl get hpa -n production
# Look for: REPLICAS == MAXPODS
What you're looking for: Error rate > 1% sustained, p95 latency > SLO threshold, or SLO burn rate > 14.4x = production impact, scale immediately. Common pitfall: p50 latency looks fine but p99 is terrible. Always check multiple percentiles. Users experience tail latency, not median.

Check 2: Scaling availability

Command/method:

# Check HPA current vs max replicas
kubectl get hpa -n production -o \
  jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.currentReplicas}{"\t"}{.spec.maxReplicas}{"\n"}{end}'

# Check node capacity available for new pods
kubectl describe nodes | grep -A5 "Allocatable:"
kubectl describe nodes | grep -A10 "Allocated resources:"

# Check AWS/GCP/Azure quota
aws service-quotas list-service-quotas --service-code ec2 | \
  jq '.Quotas[] | select(.QuotaName | contains("Running On-Demand"))'

# Check if DB can scale (connections, read replicas)
kubectl exec -it db-pod -- psql -c "SELECT count(*) FROM pg_stat_activity;"
kubectl exec -it db-pod -- psql -c "SHOW max_connections;"
What you're looking for: HPA max replicas not yet hit, nodes have allocatable CPU/memory, cloud quota not exhausted = scaling is possible. Common pitfall: HPA can scale pods but there are no nodes to schedule them on. Pod stays Pending while production burns. Check node capacity AND cluster autoscaler status together.

Check 3 / Check 4: Bottleneck identification

Command/method:

# CPU metrics per pod
kubectl top pods -n production --sort-by=cpu | head -10

# Memory metrics per pod
kubectl top pods -n production --sort-by=memory | head -10

# CPU throttling (cgroup throttling = CPU limit too low, not always actual bottleneck)
kubectl exec -it myapp-pod -- cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled

# I/O wait (high iowait = disk or network bound)
kubectl exec -it myapp-pod -- top -bn1 | grep "Cpu(s)"
kubectl exec -it myapp-pod -- iostat -x 1 5

# DB slow queries
kubectl exec -it db-pod -- psql -c \
  "SELECT query, calls, total_time/calls as avg_ms, rows
   FROM pg_stat_statements
   ORDER BY total_time DESC LIMIT 10;"

# Lock contention
kubectl exec -it db-pod -- psql -c \
  "SELECT pid, wait_event_type, wait_event, query
   FROM pg_stat_activity
   WHERE wait_event IS NOT NULL;"

# Network saturation
kubectl exec -it myapp-pod -- cat /proc/net/dev
What you're looking for: Which resource is the limiting factor — CPU near 100%, memory with OOM events, iowait > 20%, or DB query times accounting for > 50% of request latency. Common pitfall: CPU metrics in Kubernetes show "requests" utilization, not actual CPU usage. A pod requesting 100m CPU but using 800m will look fine in kubectl top if limits are set high. Check both requests and actual utilization.

Check 5: Application code vs infrastructure bottleneck

Command/method:

# Profile the application (Go example)
kubectl exec -it myapp-pod -- curl localhost:6060/debug/pprof/profile?seconds=30 \
  -o /tmp/cpu.prof
go tool pprof -top /tmp/cpu.prof | head -20

# Node.js profiling
kubectl exec -it myapp-pod -- kill -USR1 1  # Enable profiler
# Then collect flamegraph

# JVM profiling (Java/Kotlin/Scala)
kubectl exec -it myapp-pod -- jcmd 1 VM.native_memory
kubectl exec -it myapp-pod -- jmap -histo 1 | head -20

# Check if CPU time is in syscalls (infrastructure) vs userspace (application)
kubectl exec -it myapp-pod -- strace -c -p 1 &
sleep 10 && kill %1
# High % time in epoll_wait = waiting for I/O (infrastructure)
# High % time in user functions = application code
What you're looking for: Top functions in profiler output accounting for > 20% of CPU time = clear optimization target. If CPU is spread across hundreds of small functions with no obvious hot path = likely scaling is needed. Common pitfall: Profiling in staging under low load gives a different profile than production under high load. Profile in production (with appropriate safety measures) or use load testing that matches production traffic patterns.

Check 7: Memory leak vs high steady-state usage

Command/method:

# Memory growth over time — is it monotonically increasing?
kubectl exec -it prometheus-pod -- promtool query range \
  --start=$(date -d '6 hours ago' +%s) --end=$(date +%s) --step=300 \
  'container_memory_working_set_bytes{pod=~"myapp-.*"}'

# Heap dump for JVM
kubectl exec -it myapp-pod -- jmap -dump:format=b,file=/tmp/heap.hprof 1
kubectl cp myapp-pod:/tmp/heap.hprof ./heap.hprof
# Then analyze with Eclipse MAT or similar

# Go memory profiling
kubectl exec -it myapp-pod -- curl localhost:6060/debug/pprof/heap -o /tmp/heap.prof
go tool pprof /tmp/heap.prof

# Check for open file descriptor leaks (common in connection leaks)
kubectl exec -it myapp-pod -- ls /proc/1/fd | wc -l
kubectl exec -it myapp-pod -- cat /proc/sys/fs/file-max
What you're looking for: Memory growing 10%+ per hour without leveling off = leak. Memory stable at a high level = sizing issue. OOM kills after running for X hours = leak at a specific rate. Common pitfall: JVM garbage collection makes "memory growth" appear as a leak when it is just GC not collecting frequently. Check heap after full GC, not during normal operation.

Check 8: Hot query / code path identification

Command/method:

# Find the most expensive queries (PostgreSQL)
kubectl exec -it db-pod -- psql -c \
  "SELECT LEFT(query, 80) as query,
          calls,
          round(total_time::numeric, 2) as total_ms,
          round((total_time/calls)::numeric, 2) as avg_ms,
          round(100.0 * total_time / nullif(sum(total_time) OVER (), 0), 2) as pct
   FROM pg_stat_statements
   WHERE calls > 100
   ORDER BY total_time DESC
   LIMIT 15;"

# Check for missing indexes
kubectl exec -it db-pod -- psql -c \
  "SELECT schemaname, tablename, attname, n_distinct, correlation
   FROM pg_stats
   WHERE tablename = 'orders'
   ORDER BY n_distinct;"

# Check query plans
kubectl exec -it db-pod -- psql -c \
  "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 12345;"
# Look for: Seq Scan on large tables = missing index

# APM trace data — which endpoint / operation is slowest?
# (Datadog, Jaeger, Tempo, etc. — check your APM tool)
What you're looking for: A single query accounting for > 20% of total DB time, or a Seq Scan on a table with > 100K rows = high ROI optimization target. Common pitfall: Query time in the DB looks fast (< 10ms) but the service is slow — the bottleneck is query volume (N+1 problem). Check queries-per-second and the call stack that generates each query.

Check 11: Prior optimization history

Command/method:

# Search git history for performance optimization work
git log --all --oneline --grep="perf\|optimize\|performance\|profil" | head -20

# Search JIRA/Linear/GitHub issues for past performance work
gh issue list --label "performance" --state closed --limit 50

# Check if there's a capacity planning doc
ls /workspace/runbooks/ | grep -i "capacity\|scaling\|performance"

# Check APM for historical baselines (has p95 always been this high?)
# Look at 90-day trend in your APM tool
What you're looking for: Recent performance work with documented findings = you've likely hit architecture limits, scale. No profiling data or profiling was done > 6 months ago = profile again before scaling. Common pitfall: "We already optimized this 2 years ago" — the optimization was for a different load profile. Profiling is not a one-time activity; re-profile as load characteristics change.


Terminal Actions

✅ Action: Scale Now + Profile and Optimize in Parallel (Production Impact)

Do:

# STEP 1: Scale immediately to stop the bleeding
# Kubernetes HPA — increase max replicas
kubectl patch hpa myapp -n production -p \
  '{"spec":{"maxReplicas":20}}'

# Or manual scale for immediate relief
kubectl scale deployment myapp -n production --replicas=12

# STEP 2: Verify pods are coming up
kubectl get pods -n production -l app=myapp -w
kubectl rollout status deployment/myapp -n production --timeout=5m

# STEP 3: Confirm error rate dropping
watch -n10 'kubectl exec prometheus-pod -- promtool query instant \
  "sum(rate(http_requests_total{status=~\"5..\"}[2m]))"'

# STEP 4: Open a profiling task while scaled (do not defer indefinitely)
gh issue create --repo org/myapp \
  --title "Profile myapp after scaling event on $(date +%Y-%m-%d)" \
  --label "performance,post-incident" \
  --body "Scaled to 12 replicas during production impact. Must profile and optimize to avoid recurring scaling events. Assign this sprint."

# STEP 5: Begin profiling in background (on a non-prod replica if possible)
kubectl exec -it $(kubectl get pods -l app=myapp -o name | tail -1) -- \
  curl localhost:6060/debug/pprof/profile?seconds=60 -o /tmp/cpu.prof
Verify: Error rate returns to baseline, p95 latency within SLO, pods in Ready state. Confirm a profiling ticket is created before closing the incident. Runbook: scaling-emergency.md

✅ Action: Profile and Fix Specific Bottleneck (Non-Urgent)

Do:

# 1. Capture baseline metrics
kubectl exec prometheus-pod -- promtool query instant \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))' \
  | tee /tmp/baseline-p95.txt

# 2. Run CPU profile under production-like load
kubectl exec -it myapp-pod -- curl -s localhost:6060/debug/pprof/profile?seconds=60 \
  -o /tmp/cpu.prof
go tool pprof -svg /tmp/cpu.prof > /tmp/flamegraph.svg

# 3. Implement the targeted fix (guided by profiler, not guesswork)
# 4. Deploy to staging and measure
# 5. Run load test to validate improvement
k6 run --vus 100 --duration 5m load-test.js | tee /tmp/post-fix-p95.txt

# 6. Compare: did the fix actually help?
diff /tmp/baseline-p95.txt /tmp/post-fix-p95.txt

# 7. Deploy to production and monitor
kubectl rollout status deployment/myapp -n production
Verify: p95 latency improved by at least 20% after fix. If improvement is < 5%, the hot path was misidentified — profile again. Runbook: performance-profiling.md

✅ Action: Horizontal Pod Autoscaler Tuning

Do:

# Review current HPA configuration
kubectl get hpa myapp -n production -o yaml

# Adjust scale-up/scale-down behavior
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60    # Scale up before hitting 70%
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Respond quickly to load spikes
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Scale down slowly to avoid thrashing
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60
EOF

# Verify HPA is scaling correctly
kubectl describe hpa myapp -n production | grep -A20 "Events:"
Verify: HPA scales up proactively at 60% CPU (before the SLO threshold is hit), and scales down slowly over 5 minutes to prevent oscillation.

✅ Action: Database Query Optimization Before Scaling DB

Do:

# 1. Identify the worst query
kubectl exec -it db-pod -- psql -c \
  "SELECT LEFT(query, 100), calls, round(total_time/calls, 2) as avg_ms
   FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"

# 2. Capture EXPLAIN ANALYZE for the worst query
kubectl exec -it db-pod -- psql -c \
  "EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';"

# 3. Create index if Seq Scan on high-cardinality column
kubectl exec -it db-pod -- psql -c \
  "CREATE INDEX CONCURRENTLY idx_orders_user_status ON orders(user_id, status) WHERE status = 'pending';"
# CONCURRENTLY = does not lock the table during creation

# 4. Verify index is used
kubectl exec -it db-pod -- psql -c \
  "EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';"
# Should now show: Index Scan using idx_orders_user_status

# 5. Check query time after index creation
kubectl exec -it db-pod -- psql -c \
  "SELECT round(total_time/calls, 2) as avg_ms FROM pg_stat_statements
   WHERE query LIKE '%orders%user_id%' ORDER BY total_time DESC LIMIT 1;"
Verify: Query average execution time drops by at least 50%. EXPLAIN shows Index Scan instead of Seq Scan.

✅ Action: Cache Hot Data to Reduce Compute

Do:

# 1. Identify what's being computed repeatedly (same inputs, same outputs)
# Look for: high hit rate on specific endpoints, repeated DB queries with same params

# 2. Add Redis caching layer
kubectl apply -f kubernetes/production/redis-cache.yaml

# 3. Implement cache-aside pattern in application
# Example (pseudo-code):
# cache_key = f"user:{user_id}:profile"
# result = redis.get(cache_key)
# if result is None:
#     result = db.query("SELECT * FROM users WHERE id = %s", user_id)
#     redis.setex(cache_key, 300, serialize(result))  # 5-minute TTL

# 4. Monitor cache hit rate
kubectl exec -it redis-pod -- redis-cli INFO stats | grep "keyspace_hits\|keyspace_misses"
# Target: hit rate > 80% for effective caching

# 5. Set resource limits on Redis
kubectl set resources deployment/redis --limits=cpu=500m,memory=1Gi
Verify: Cache hit rate > 80%, DB query volume drops by proportional amount, application latency decreases.

⚠️ Warning: Scaling Does Not Fix Memory Leaks

When: Memory is growing monotonically over time (not leveling off), OOM kills happen on a predictable schedule, or pods need regular restarts to recover. Risk: Scaling horizontally adds more pods that will all eventually OOM. You have more capacity temporarily, but the leak multiplies across all pods and the failure recurs. Mitigation: Scale as a temporary measure to reduce immediate pressure, but fix the leak immediately. A memory leak in production is a time-limited emergency, not a capacity problem.

⚠️ Warning: Contention Does Not Scale Horizontally

When: The bottleneck is a single resource that all replicas compete for — a database lock, a single-threaded queue, a global mutex, or a singleton external API rate limit. Risk: Adding more pods increases contention on the shared resource, making throughput worse or causing deadlocks and cascade failures. Mitigation: Identify the lock/contention point. Options: increase parallelism of the contended resource, use optimistic locking, add sharding, or use a non-blocking queue. Horizontal scaling without addressing contention is counterproductive.


Edge Cases

  • Event-driven spike vs organic growth: A spike from a marketing campaign or news event is temporary — scale to handle it, then scale back down. A sustained trend from organic growth needs a capacity planning conversation and potentially architectural changes, not just more pods.
  • Scaling a stateful service: Scaling a stateful service (Kafka consumer group, Elasticsearch data node, database) is not as simple as adjusting replicas. Each stateful component has specific scale-out procedures involving data redistribution, rebalancing, and potential downtime.
  • "Scaling the database" means different things: Adding read replicas helps read-heavy workloads. Vertical scaling helps CPU/memory-bound workloads. Sharding helps write-heavy workloads. These require different architectural decisions — clarify which type of DB scaling is needed before deciding to optimize vs scale.
  • Cloud cost cliff: Some cloud services have non-linear cost curves. Scaling from 10 to 20 pods may cost 2x, but scaling from 20 to 30 may trigger a different tier at 5x the cost. Check pricing curves before scaling blindly.

Cross-References