The Load Balancer Lied
- lesson
- health-checks
- connection-draining
- l4-vs-l7
- sticky-sessions
- graceful-shutdown
- l2 ---# The Load Balancer Lied
Topics: health checks, connection draining, L4 vs L7, sticky sessions, graceful shutdown Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic networking (see "What Happens When You Click a Link")
The Mission¶
The dashboard shows all backends healthy. Green checkmarks everywhere. But users are seeing 500 errors. You curl the health endpoint — 200 OK. You curl the actual API — 500 Internal Server Error.
The load balancer is lying. Not maliciously — it's checking the wrong thing. The health check passes (the web framework responds) while the application is broken (the database connection is dead). This is one of the most common and most subtle failure modes in production.
How Health Checks Actually Work¶
A load balancer periodically sends requests to each backend. If the backend responds correctly, it's "healthy." If it doesn't, the backend is removed from the pool.
Load Balancer → GET /health → Backend A: 200 OK ✓ (in pool)
Load Balancer → GET /health → Backend B: 200 OK ✓ (in pool)
Load Balancer → GET /health → Backend C: no response ✗ (removed)
The shallow health check trap¶
# BAD — always returns 200 if the web framework is running
@app.get("/health")
def health():
return {"status": "ok"}
# This checks: "is Python running?" Not: "can the app serve requests?"
The framework is running. Python is alive. But the database connection pool is exhausted, Redis is unreachable, and the disk is full. The health check returns 200 while every real request returns 500.
# GOOD — checks actual dependencies
@app.get("/health")
def health():
checks = {}
# Check database
try:
db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = str(e)
return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)
# Check Redis
try:
redis.ping()
checks["redis"] = "ok"
except Exception as e:
checks["redis"] = str(e)
return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)
return {"status": "healthy", "checks": checks}
Gotcha: Deep health checks can cause cascading failures. If the database is slow (not down), the health check times out. The LB removes the backend. Now all traffic goes to remaining backends, which also time out on the health check. All backends removed. Health checks killed your entire service because the database was slow.
Fix: Separate liveness (shallow: "is the process alive?") from readiness (deep: "can it serve traffic?"). Liveness should never check dependencies. Readiness should.
L4 vs L7: What the Load Balancer Sees¶
Layer 4 (TCP)¶
The LB works with TCP connections. It can see: source IP, destination IP, ports. It CANNOT see: HTTP headers, URLs, cookies, response bodies.
L4 health check: "Can I open a TCP connection to port 8080?"
→ Connection succeeds = healthy
→ Connection refused = unhealthy
→ Timeout = unhealthy
An L4 health check only proves the port is open. The app could be returning 500 on every request, and the L4 check still shows green.
Layer 7 (HTTP)¶
The LB terminates the HTTP connection and inspects the request/response. It can see everything: headers, status codes, URLs, cookies.
L7 health check: "Does GET /health return 200?"
→ 200 OK = healthy
→ 503 = unhealthy
→ Timeout = unhealthy
L7 is more accurate but more expensive (must parse HTTP, terminate TLS).
| L4 (TCP) | L7 (HTTP) | |
|---|---|---|
| What it checks | TCP connection succeeds | HTTP status code |
| Can see content | No | Yes (headers, body, URL) |
| Can route by path | No | Yes (/api → backend A, /static → backend B) |
| TLS termination | No (passthrough) or yes | Yes (always) |
| Performance | Faster (less processing) | Slower (HTTP parsing) |
| False positives | High (port open ≠ app works) | Lower (can check response) |
Connection Draining: The Deploy Disaster¶
You deploy a new version. The LB removes the old backend and adds the new one. But the old backend had 150 active connections mid-request. Without draining, those connections get RST (reset) — 150 users see "connection reset" errors.
WITHOUT draining:
11:00:00 — Deploy starts
11:00:01 — Old backend removed from LB pool
11:00:01 — 150 active connections killed (RST)
11:00:01 — 150 users see errors
WITH draining:
11:00:00 — Deploy starts
11:00:01 — Old backend marked "draining" (no NEW connections)
11:00:01 — 150 existing connections continue normally
11:00:30 — Last connection finishes
11:00:30 — Old backend removed (0 active connections)
11:00:30 — 0 users see errors
# HAProxy: drain a server (stop new connections, finish existing)
echo "set server app_servers/app1 state drain" | socat stdio /var/run/haproxy.sock
# Kubernetes: this happens automatically during rolling updates
# Pod gets SIGTERM → preStop hook → terminationGracePeriodSeconds
# During this window, the pod is removed from Endpoints (no new traffic)
# but existing requests finish
Gotcha: If your
terminationGracePeriodSeconds(default 30s) is shorter than your longest request, some requests will be killed by SIGKILL when the grace period expires. Long-running requests (file uploads, report generation) need longer grace periods.
Sticky Sessions: The Hidden Coupling¶
Sticky sessions (session affinity) route a user to the same backend for the duration of their session. Sounds useful. It's usually a trap.
Request 1 from User A → Backend 1 (session created in memory)
Request 2 from User A → Backend 1 (sticky — session found)
Request 3 from User A → Backend 1 (still sticky)
Backend 1 crashes.
Request 4 from User A → Backend 2 (session is GONE — logged out, cart empty)
Sticky sessions hide a design problem: storing session state in local memory instead of an external store (Redis, database). When the backend dies, the session dies with it.
Mental Model: Sticky sessions are training wheels. They let you pretend each server is independent while actually coupling users to specific servers. Remove the training wheels: store sessions externally. Then any backend can serve any user, and backend failures are transparent.
The Health Check Cascade¶
Aggressive health checks can kill a healthy system:
Normal:
Health check interval: 1s, fail threshold: 2 misses
All 5 backends healthy, serving 1000 req/s
CPU spike (batch job, GC pause):
Backend A health check takes 1.2s (just over 1s interval)
LB misses 2 consecutive checks → marks A unhealthy → removes from pool
Traffic redistributes:
4 backends now handle 1000 req/s (was 200 each, now 250 each)
CPU on remaining backends increases by 25%
Cascade:
Backend B now under more load → health check latency increases
Backend B misses checks → removed
3 backends handle 1000 req/s (333 each) → more load → more removals
Eventually: 1 backend handles all traffic → crashes → total outage
War Story: A team set health check interval to 1 second with failure threshold of 2. A brief CPU spike from a JVM full GC (300ms pause) caused one backend to miss two consecutive checks. The LB removed it. The increased load on remaining backends caused more GC pauses, more missed checks, more removals. All 5 backends were removed within 90 seconds. The original GC pause lasted 300ms. The cascading outage lasted 12 minutes.
Fix: Health check interval 5-10s, failure threshold 3-5. Slow start (gradually increase traffic to recovering backends). Don't make health checks faster just because you want "faster detection" — the cascade risk outweighs the detection speed.
Flashcard Check¶
Q1: Health check returns 200 but users see 500s. What's wrong?
Shallow health check. The check verifies the framework is running, not that dependencies (database, Redis, etc.) are working. Use deep readiness checks.
Q2: L4 vs L7 health check — which catches a broken app returning 500?
L7. It checks the HTTP status code. L4 only checks if the TCP port is open — an app returning 500 on every request still has port 8080 open.
Q3: What is connection draining?
When removing a backend, stop sending NEW connections but let existing connections finish. Without draining, active connections are killed (RST) during deploys.
Q4: Why are sticky sessions usually a bad idea?
They couple users to specific backends. When that backend dies, the session is lost. Store sessions externally (Redis) so any backend can serve any user.
Q5: How can aggressive health checks cause an outage?
Short interval + low failure threshold → brief CPU spike removes a backend → remaining backends get more load → more miss health checks → cascade removes all backends.
Cheat Sheet¶
Health Check Configuration¶
| Setting | Safe default | Too aggressive |
|---|---|---|
| Interval | 5-10s | 1s (cascade risk) |
| Timeout | 3-5s | 1s (GC pauses trigger false failures) |
| Failure threshold | 3-5 misses | 1-2 (too sensitive) |
| Success threshold | 2 passes | 1 (premature return) |
Health Check Types¶
| Type | What it checks | Use for |
|---|---|---|
| TCP (L4) | Port is open | Basic connectivity |
| HTTP (L7) shallow | Framework responds 200 | Liveness ("is it alive?") |
| HTTP (L7) deep | Dependencies work | Readiness ("can it serve?") |
HAProxy Commands¶
| Task | Command |
|---|---|
| Show server status | echo "show stat" \| socat stdio /var/run/haproxy.sock |
| Drain server | echo "set server pool/server1 state drain" \| socat ... |
| Enable server | echo "set server pool/server1 state ready" \| socat ... |
| Disable server | echo "set server pool/server1 state maint" \| socat ... |
Takeaways¶
-
Shallow health checks lie. "Port is open" and "framework responds" don't mean "app works." Check actual dependencies in readiness probes.
-
But deep checks cascade. If the database is slow, deep checks timeout, LB removes all backends. Separate liveness (shallow) from readiness (deep).
-
Connection draining prevents deploy errors. Without it, active requests get RST during deploys. Set
terminationGracePeriodSecondslonger than your longest request. -
Health check timing is a tradeoff. Fast detection (1s interval) = high cascade risk. Safe detection (10s interval) = slower failover. Default to 5-10s, fail after 3-5 misses.
-
Sticky sessions are training wheels. Store sessions externally. Then any backend serves any user, and failover is transparent.
Related Lessons¶
- The Cascading Timeout — when health check cascades cause total outage
- Connection Refused — what users see when backends are removed
- Deploy a Web App From Nothing — the Nginx reverse proxy layer
- The Nginx Config That Broke Everything — proxy configuration gotchas