Decision Tree: Scale Up or Optimize First?¶
Category: Operational Decisions Starting Question: "The system is struggling under load — should I scale up or optimize?" Estimated traversal: 3-5 minutes Domains: performance, capacity-planning, SRE, infrastructure, profiling
The Tree¶
The system is struggling under load — scale up or optimize first?
│
├── [Check 1] Is there active production impact RIGHT NOW?
│ │ (users experiencing errors, latency > SLO, revenue path degraded)
│ │
│ ├── YES — production is impacted
│ │ ├── → ✅ SCALE IMMEDIATELY — do not wait to optimize
│ │ │ Scaling is reversible; user impact is not
│ │ │
│ │ └── [Check 2] CAN you scale? (autoscaler headroom, node capacity, quota)
│ │ ├── YES → scale now, profile and optimize in parallel (see terminal action)
│ │ └── NO (quota exhausted, no nodes available, DB can't scale horizontally)
│ │ ├── [Check 3] What is the bottleneck?
│ │ │ ├── CPU → can you shed load? (circuit breaker, rate limit, queue)
│ │ │ ├── Memory → can you reduce object retention? (GC tuning, buffer sizes)
│ │ │ ├── DB connections → can you add a connection pooler (PgBouncer)?
│ │ │ ├── I/O → can you add a read replica or cache layer?
│ │ │ └── Network → check for N+1 queries or chatty protocols
│ │ └── → ✅ EMERGENCY OPTIMIZATION (targeted, fastest path to relief)
│ │
│ └── NO — system is struggling but not yet impacted
│ (latency trending up, error budget burn accelerating, capacity alert)
│ │
│ ├── [Check 4] What is the bottleneck resource?
│ │ │
│ │ ├── CPU bound (CPU > 70% sustained, CPU throttling in cgroups)
│ │ │ ├── [Check 5] Is this application code or infrastructure?
│ │ │ │ ├── Application code (profiler shows hot path, inefficient algorithm)
│ │ │ │ │ ├── [Check 6] Is the fix scoped and estimated < 3 days of work?
│ │ │ │ │ │ ├── YES → ✅ PROFILE AND FIX (optimize first)
│ │ │ │ │ │ └── NO (requires architectural change, months of work)
│ │ │ │ │ │ └── → ✅ SCALE NOW, put optimization on roadmap
│ │ │ │ └── Infrastructure (serialization overhead, GC pressure, syscall volume)
│ │ │ │ └── → ✅ SCALE + INVESTIGATE (infrastructure tuning takes time)
│ │ │
│ │ ├── Memory bound (OOM kills, swap usage, GC frequency high)
│ │ │ ├── [Check 7] Is there a memory leak? (memory grows unbounded over time)
│ │ │ │ ├── YES (heap dumps show growing retention) → ✅ FIX THE LEAK first
│ │ │ │ │ Scaling just delays the OOM; the leak will fill any amount of RAM
│ │ │ │ └── NO (steady-state usage is just high)
│ │ │ │ └── → ✅ SCALE MEMORY + set resource limits + alert on trend
│ │ │
│ │ ├── I/O bound (disk or network saturated, high I/O wait, slow DB queries)
│ │ │ ├── [Check 8] Is there a specific query or code path consuming > 80% of I/O?
│ │ │ │ ├── YES (single query, hot table, missing index identified)
│ │ │ │ │ └── → ✅ OPTIMIZE FIRST (targeted fix, high ROI, faster than scaling DB)
│ │ │ │ └── NO (I/O spread across many operations)
│ │ │ │ ├── [Check 9] Is load temporary or permanent?
│ │ │ │ │ ├── Temporary spike (event, batch, seasonal)
│ │ │ │ │ │ └── → ✅ SCALE for spike + optimize during quiet period
│ │ │ │ │ └── Permanent growth (organic traffic increase)
│ │ │ │ │ └── → ✅ SCALE + CACHE HOT DATA + alert on disk trend
│ │ │
│ │ └── Lock / contention bound (DB deadlocks, mutex contention, queue saturation)
│ │ ├── [Check 10] Is the contention in application code or DB schema?
│ │ │ ├── Application code (single mutex, synchronous queue)
│ │ │ │ └── → ✅ OPTIMIZE FIRST — horizontal scaling doesn't help contention
│ │ │ └── DB schema (hot row, table-level lock, missing index on join column)
│ │ │ └── → ✅ DB QUERY OPTIMIZATION before scaling DB
│ │
│ └── [Check 11] Has optimization been attempted before?
│ ├── YES (profiling done, quick wins taken, known architecture limits)
│ │ └── → ✅ SCALE — you've hit the optimization ceiling for now
│ └── NO (no profiling data, operating on assumptions)
│ └── → ✅ PROFILE FIRST (don't scale what you haven't measured)
│ Spend 2 hours profiling before spending money on scaling
Node Details¶
Check 1: Active production impact¶
Command/method:
# Real-time error rate
kubectl exec -it prometheus-pod -- promtool query instant \
'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
# P95 latency vs SLO
kubectl exec -it prometheus-pod -- promtool query instant \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
# Compare to SLO threshold (e.g., < 500ms)
# SLO burn rate — are we bleeding error budget fast?
kubectl exec -it prometheus-pod -- promtool query instant \
'(1 - sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / 0.001'
# Check HPA — is autoscaler already at max replicas?
kubectl get hpa -n production
# Look for: REPLICAS == MAXPODS
Check 2: Scaling availability¶
Command/method:
# Check HPA current vs max replicas
kubectl get hpa -n production -o \
jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.currentReplicas}{"\t"}{.spec.maxReplicas}{"\n"}{end}'
# Check node capacity available for new pods
kubectl describe nodes | grep -A5 "Allocatable:"
kubectl describe nodes | grep -A10 "Allocated resources:"
# Check AWS/GCP/Azure quota
aws service-quotas list-service-quotas --service-code ec2 | \
jq '.Quotas[] | select(.QuotaName | contains("Running On-Demand"))'
# Check if DB can scale (connections, read replicas)
kubectl exec -it db-pod -- psql -c "SELECT count(*) FROM pg_stat_activity;"
kubectl exec -it db-pod -- psql -c "SHOW max_connections;"
Pending while production burns. Check node capacity AND cluster autoscaler status together.
Check 3 / Check 4: Bottleneck identification¶
Command/method:
# CPU metrics per pod
kubectl top pods -n production --sort-by=cpu | head -10
# Memory metrics per pod
kubectl top pods -n production --sort-by=memory | head -10
# CPU throttling (cgroup throttling = CPU limit too low, not always actual bottleneck)
kubectl exec -it myapp-pod -- cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled
# I/O wait (high iowait = disk or network bound)
kubectl exec -it myapp-pod -- top -bn1 | grep "Cpu(s)"
kubectl exec -it myapp-pod -- iostat -x 1 5
# DB slow queries
kubectl exec -it db-pod -- psql -c \
"SELECT query, calls, total_time/calls as avg_ms, rows
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 10;"
# Lock contention
kubectl exec -it db-pod -- psql -c \
"SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;"
# Network saturation
kubectl exec -it myapp-pod -- cat /proc/net/dev
kubectl top if limits are set high. Check both requests and actual utilization.
Check 5: Application code vs infrastructure bottleneck¶
Command/method:
# Profile the application (Go example)
kubectl exec -it myapp-pod -- curl localhost:6060/debug/pprof/profile?seconds=30 \
-o /tmp/cpu.prof
go tool pprof -top /tmp/cpu.prof | head -20
# Node.js profiling
kubectl exec -it myapp-pod -- kill -USR1 1 # Enable profiler
# Then collect flamegraph
# JVM profiling (Java/Kotlin/Scala)
kubectl exec -it myapp-pod -- jcmd 1 VM.native_memory
kubectl exec -it myapp-pod -- jmap -histo 1 | head -20
# Check if CPU time is in syscalls (infrastructure) vs userspace (application)
kubectl exec -it myapp-pod -- strace -c -p 1 &
sleep 10 && kill %1
# High % time in epoll_wait = waiting for I/O (infrastructure)
# High % time in user functions = application code
Check 7: Memory leak vs high steady-state usage¶
Command/method:
# Memory growth over time — is it monotonically increasing?
kubectl exec -it prometheus-pod -- promtool query range \
--start=$(date -d '6 hours ago' +%s) --end=$(date +%s) --step=300 \
'container_memory_working_set_bytes{pod=~"myapp-.*"}'
# Heap dump for JVM
kubectl exec -it myapp-pod -- jmap -dump:format=b,file=/tmp/heap.hprof 1
kubectl cp myapp-pod:/tmp/heap.hprof ./heap.hprof
# Then analyze with Eclipse MAT or similar
# Go memory profiling
kubectl exec -it myapp-pod -- curl localhost:6060/debug/pprof/heap -o /tmp/heap.prof
go tool pprof /tmp/heap.prof
# Check for open file descriptor leaks (common in connection leaks)
kubectl exec -it myapp-pod -- ls /proc/1/fd | wc -l
kubectl exec -it myapp-pod -- cat /proc/sys/fs/file-max
Check 8: Hot query / code path identification¶
Command/method:
# Find the most expensive queries (PostgreSQL)
kubectl exec -it db-pod -- psql -c \
"SELECT LEFT(query, 80) as query,
calls,
round(total_time::numeric, 2) as total_ms,
round((total_time/calls)::numeric, 2) as avg_ms,
round(100.0 * total_time / nullif(sum(total_time) OVER (), 0), 2) as pct
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_time DESC
LIMIT 15;"
# Check for missing indexes
kubectl exec -it db-pod -- psql -c \
"SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY n_distinct;"
# Check query plans
kubectl exec -it db-pod -- psql -c \
"EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 12345;"
# Look for: Seq Scan on large tables = missing index
# APM trace data — which endpoint / operation is slowest?
# (Datadog, Jaeger, Tempo, etc. — check your APM tool)
Check 11: Prior optimization history¶
Command/method:
# Search git history for performance optimization work
git log --all --oneline --grep="perf\|optimize\|performance\|profil" | head -20
# Search JIRA/Linear/GitHub issues for past performance work
gh issue list --label "performance" --state closed --limit 50
# Check if there's a capacity planning doc
ls /workspace/runbooks/ | grep -i "capacity\|scaling\|performance"
# Check APM for historical baselines (has p95 always been this high?)
# Look at 90-day trend in your APM tool
Terminal Actions¶
✅ Action: Scale Now + Profile and Optimize in Parallel (Production Impact)¶
Do:
# STEP 1: Scale immediately to stop the bleeding
# Kubernetes HPA — increase max replicas
kubectl patch hpa myapp -n production -p \
'{"spec":{"maxReplicas":20}}'
# Or manual scale for immediate relief
kubectl scale deployment myapp -n production --replicas=12
# STEP 2: Verify pods are coming up
kubectl get pods -n production -l app=myapp -w
kubectl rollout status deployment/myapp -n production --timeout=5m
# STEP 3: Confirm error rate dropping
watch -n10 'kubectl exec prometheus-pod -- promtool query instant \
"sum(rate(http_requests_total{status=~\"5..\"}[2m]))"'
# STEP 4: Open a profiling task while scaled (do not defer indefinitely)
gh issue create --repo org/myapp \
--title "Profile myapp after scaling event on $(date +%Y-%m-%d)" \
--label "performance,post-incident" \
--body "Scaled to 12 replicas during production impact. Must profile and optimize to avoid recurring scaling events. Assign this sprint."
# STEP 5: Begin profiling in background (on a non-prod replica if possible)
kubectl exec -it $(kubectl get pods -l app=myapp -o name | tail -1) -- \
curl localhost:6060/debug/pprof/profile?seconds=60 -o /tmp/cpu.prof
✅ Action: Profile and Fix Specific Bottleneck (Non-Urgent)¶
Do:
# 1. Capture baseline metrics
kubectl exec prometheus-pod -- promtool query instant \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))' \
| tee /tmp/baseline-p95.txt
# 2. Run CPU profile under production-like load
kubectl exec -it myapp-pod -- curl -s localhost:6060/debug/pprof/profile?seconds=60 \
-o /tmp/cpu.prof
go tool pprof -svg /tmp/cpu.prof > /tmp/flamegraph.svg
# 3. Implement the targeted fix (guided by profiler, not guesswork)
# 4. Deploy to staging and measure
# 5. Run load test to validate improvement
k6 run --vus 100 --duration 5m load-test.js | tee /tmp/post-fix-p95.txt
# 6. Compare: did the fix actually help?
diff /tmp/baseline-p95.txt /tmp/post-fix-p95.txt
# 7. Deploy to production and monitor
kubectl rollout status deployment/myapp -n production
✅ Action: Horizontal Pod Autoscaler Tuning¶
Do:
# Review current HPA configuration
kubectl get hpa myapp -n production -o yaml
# Adjust scale-up/scale-down behavior
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale up before hitting 70%
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Respond quickly to load spikes
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Scale down slowly to avoid thrashing
policies:
- type: Percent
value: 25
periodSeconds: 60
EOF
# Verify HPA is scaling correctly
kubectl describe hpa myapp -n production | grep -A20 "Events:"
✅ Action: Database Query Optimization Before Scaling DB¶
Do:
# 1. Identify the worst query
kubectl exec -it db-pod -- psql -c \
"SELECT LEFT(query, 100), calls, round(total_time/calls, 2) as avg_ms
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"
# 2. Capture EXPLAIN ANALYZE for the worst query
kubectl exec -it db-pod -- psql -c \
"EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';"
# 3. Create index if Seq Scan on high-cardinality column
kubectl exec -it db-pod -- psql -c \
"CREATE INDEX CONCURRENTLY idx_orders_user_status ON orders(user_id, status) WHERE status = 'pending';"
# CONCURRENTLY = does not lock the table during creation
# 4. Verify index is used
kubectl exec -it db-pod -- psql -c \
"EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status = 'pending';"
# Should now show: Index Scan using idx_orders_user_status
# 5. Check query time after index creation
kubectl exec -it db-pod -- psql -c \
"SELECT round(total_time/calls, 2) as avg_ms FROM pg_stat_statements
WHERE query LIKE '%orders%user_id%' ORDER BY total_time DESC LIMIT 1;"
EXPLAIN shows Index Scan instead of Seq Scan.
✅ Action: Cache Hot Data to Reduce Compute¶
Do:
# 1. Identify what's being computed repeatedly (same inputs, same outputs)
# Look for: high hit rate on specific endpoints, repeated DB queries with same params
# 2. Add Redis caching layer
kubectl apply -f kubernetes/production/redis-cache.yaml
# 3. Implement cache-aside pattern in application
# Example (pseudo-code):
# cache_key = f"user:{user_id}:profile"
# result = redis.get(cache_key)
# if result is None:
# result = db.query("SELECT * FROM users WHERE id = %s", user_id)
# redis.setex(cache_key, 300, serialize(result)) # 5-minute TTL
# 4. Monitor cache hit rate
kubectl exec -it redis-pod -- redis-cli INFO stats | grep "keyspace_hits\|keyspace_misses"
# Target: hit rate > 80% for effective caching
# 5. Set resource limits on Redis
kubectl set resources deployment/redis --limits=cpu=500m,memory=1Gi
⚠️ Warning: Scaling Does Not Fix Memory Leaks¶
When: Memory is growing monotonically over time (not leveling off), OOM kills happen on a predictable schedule, or pods need regular restarts to recover. Risk: Scaling horizontally adds more pods that will all eventually OOM. You have more capacity temporarily, but the leak multiplies across all pods and the failure recurs. Mitigation: Scale as a temporary measure to reduce immediate pressure, but fix the leak immediately. A memory leak in production is a time-limited emergency, not a capacity problem.
⚠️ Warning: Contention Does Not Scale Horizontally¶
When: The bottleneck is a single resource that all replicas compete for — a database lock, a single-threaded queue, a global mutex, or a singleton external API rate limit. Risk: Adding more pods increases contention on the shared resource, making throughput worse or causing deadlocks and cascade failures. Mitigation: Identify the lock/contention point. Options: increase parallelism of the contended resource, use optimistic locking, add sharding, or use a non-blocking queue. Horizontal scaling without addressing contention is counterproductive.
Edge Cases¶
- Event-driven spike vs organic growth: A spike from a marketing campaign or news event is temporary — scale to handle it, then scale back down. A sustained trend from organic growth needs a capacity planning conversation and potentially architectural changes, not just more pods.
- Scaling a stateful service: Scaling a stateful service (Kafka consumer group, Elasticsearch data node, database) is not as simple as adjusting replicas. Each stateful component has specific scale-out procedures involving data redistribution, rebalancing, and potential downtime.
- "Scaling the database" means different things: Adding read replicas helps read-heavy workloads. Vertical scaling helps CPU/memory-bound workloads. Sharding helps write-heavy workloads. These require different architectural decisions — clarify which type of DB scaling is needed before deciding to optimize vs scale.
- Cloud cost cliff: Some cloud services have non-linear cost curves. Scaling from 10 to 20 pods may cost 2x, but scaling from 20 to 30 may trigger a different tier at 5x the cost. Check pricing curves before scaling blindly.
Cross-References¶
- Topic Packs: Performance, capacity-planning, kubernetes
- Runbooks: scaling-emergency.md, performance-profiling.md
- Related trees: should-i-page.md, rollback-or-fix-forward.md