Skip to content

Thinking Out Loud: Load Balancing

A senior SRE's internal monologue while working through a real load balancing issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

Users report that the checkout page is intermittently slow. About 1 in 4 requests takes 10+ seconds, while the rest are under 200ms. The backend is a fleet of 4 app servers behind an AWS ALB. Monitoring shows all four servers are healthy.

The Monologue

1 in 4 requests slow, 3 in 4 fast, behind a 4-server load balancer. My first thought: one of the four servers is sick but still passing health checks. Let me verify.

for i in $(seq 1 20); do
  curl -s -o /dev/null -w "%{http_code} %{time_total}s %{remote_ip}\n" https://checkout.example.com/api/cart
done

Interesting — the slow requests aren't consistently going to one IP. The ALB returns different IPs but the slow requests seem to happen from any IP. Let me check backend health from the ALB perspective.

aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/checkout-tg/abc123

All four targets: healthy. Health check interval is 30 seconds, path is /health. But /health is a lightweight endpoint — it returns 200 immediately. It doesn't test the actual checkout flow. A target can be "healthy" for the health check but "sick" for real requests.

Mental Model: Health Checks Test the Thermometer, Not the Patient

A health check that returns 200 from a static endpoint only proves the web server is running. It doesn't test database connectivity, cache availability, downstream dependencies, or resource contention. When health checks pass but users see errors, the health check isn't checking what matters. Deep health checks that test critical dependencies are worth the overhead.

Let me check each backend server individually. I need to bypass the ALB and hit each server directly.

for ip in 10.0.1.10 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo -n "$ip: "
  curl -s -o /dev/null -w "%{time_total}s" -H "Host: checkout.example.com" http://$ip:8080/api/cart
  echo
done

10.0.1.10: 0.15s, 10.0.1.11: 0.12s, 10.0.1.12: 10.3s, 10.0.1.13: 0.14s. There it is. Server 10.0.1.12 is slow. The health check passes because /health doesn't test the cart endpoint's path. Let me SSH into that server.

ssh 10.0.1.12 'top -bn1 | head -20'

CPU: 12% user, 2% system, 86% idle. That's not CPU. Let me check the app-level metrics.

ssh 10.0.1.12 'curl -s localhost:8080/metrics | grep -E "http_request_duration|connection_pool"'

connection_pool_active: 20, connection_pool_max: 20, connection_pool_wait: 45. The database connection pool is completely saturated. 20/20 active connections and 45 requests waiting for a connection. This server's DB connections are all tied up.

But why only this server? All four should have the same pool config.

for ip in 10.0.1.10 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo -n "$ip pool: "
  ssh $ip 'curl -s localhost:8080/metrics | grep connection_pool_active'
done

10: 8 active, 11: 6 active, 12: 20 active, 13: 7 active. Only server 12 is saturated. Something is consuming all its connections.

ssh 10.0.1.12 'sudo ss -tnp | grep 5432 | wc -l'
ssh 10.0.1.12 'sudo ss -tnp | grep 5432 | head -5'

22 connections to the database. The pool max is 20. Two extra connections from... a cron job. Someone has a reporting cron on this one server that opens its own database connections, competing with the application's pool.

Mental Model: Connection Pool Starvation

When a connection pool is 100% utilized, every new request has to wait for a connection to be released. Response time becomes: normal processing time + wait time for a connection. A single slow query or a competing process can hold connections long enough to starve the pool. The load balancer doesn't know about this — it sees the server as healthy and keeps sending traffic.

ssh 10.0.1.12 'ps aux | grep -i report\|cron\|etl | grep -v grep'

There it is — a generate_monthly_report.py process running since an hour ago, holding 2 long-running database connections with heavy queries. This is a leftover from before the team moved reports to a dedicated server. This one still runs via cron on a production web server.

Let me stop the report and remove the cron entry.

ssh 10.0.1.12 'sudo kill $(pgrep -f generate_monthly_report) && sudo crontab -l -u www-data | grep -v monthly_report | sudo crontab -u www-data -'

Now let me verify the pool recovered.

ssh 10.0.1.12 'curl -s localhost:8080/metrics | grep connection_pool'

Active: 8, waiting: 0. Pool recovered. Let me test the endpoint.

curl -s -o /dev/null -w "%{time_total}s" -H "Host: checkout.example.com" http://10.0.1.12:8080/api/cart

0.14 seconds. Back to normal.

Now, the bigger issues I need to address:

1. The health check needs to test actual functionality, not just "is the process alive." I'll propose a /health/deep endpoint that checks DB connectivity and pool saturation.

2. The ALB should use least-connections routing instead of round-robin. With round-robin, the ALB doesn't know that server 12's requests are taking 10x longer — it keeps sending 25% of traffic there. Least-connections would naturally shift traffic to the faster servers.

aws elbv2 describe-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/checkout-tg/abc123 --query 'Attributes[?Key==`load_balancing.algorithm.type`]'

Round robin. Let me note the recommendation to switch to least outstanding requests (LOR). That's the ALB equivalent of least-connections and would have naturally mitigated this issue by routing fewer requests to the slow server.

3. The cron job on a production server is a process hygiene issue. I'll audit all four servers for unexpected crons.

for ip in 10.0.1.10 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo "=== $ip ==="
  ssh $ip 'sudo crontab -l -u www-data 2>/dev/null; for u in root $(ls /home); do sudo crontab -l -u $u 2>/dev/null; done'
done

Checking... no other rogue crons on the other servers. Good. This was a one-off.

What Made This Senior-Level

Junior Would... Senior Does... Why
Debug the application code trying to find the slow endpoint Bypass the LB and test each backend individually to isolate the slow server Intermittent slowness behind an LB usually means one unhealthy backend, not a code bug
Accept "all targets healthy" at face value Question what the health check actually tests and find the gap A health check on /health doesn't test what matters if the issue is DB connection exhaustion
Fix the immediate issue (kill the cron) and stop there Also address the LB algorithm, health check depth, and audit for other rogue crons The immediate fix solves today; the systemic fixes prevent the next occurrence
Not think about the LB algorithm Recommend least-outstanding-requests to naturally mitigate single-server degradation Smart routing algorithms compensate for problems that round-robin ignores

Key Heuristics Used

  1. Bypass the LB to Isolate: When intermittent issues exist behind a load balancer, test each backend directly to find the degraded server.
  2. Health Checks Must Test What Matters: A health check that doesn't test database connectivity, dependency health, and resource pool saturation is a false positive machine.
  3. Connection Pool Saturation: When a pool is 100% utilized, every new request queues. Look for competing processes, slow queries, or leaked connections.

Cross-References

  • Primer — Load balancing algorithms, health check design, and connection management
  • Street Ops — ALB debugging, target group inspection, and backend isolation techniques
  • Footguns — Shallow health checks, round-robin ignoring backend degradation, and cron jobs on production servers