The Load Balancer Lied

lesson
health-checks
connection-draining
l4-vs-l7
sticky-sessions
graceful-shutdown
l2 ---# The Load Balancer Lied

Topics: health checks, connection draining, L4 vs L7, sticky sessions, graceful shutdown Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic networking (see "What Happens When You Click a Link")

The Mission¶

The dashboard shows all backends healthy. Green checkmarks everywhere. But users are seeing 500 errors. You curl the health endpoint — 200 OK. You curl the actual API — 500 Internal Server Error.

The load balancer is lying. Not maliciously — it's checking the wrong thing. The health check passes (the web framework responds) while the application is broken (the database connection is dead). This is one of the most common and most subtle failure modes in production.

How Health Checks Actually Work¶

A load balancer periodically sends requests to each backend. If the backend responds correctly, it's "healthy." If it doesn't, the backend is removed from the pool.

Load Balancer → GET /health → Backend A: 200 OK    ✓ (in pool)
Load Balancer → GET /health → Backend B: 200 OK    ✓ (in pool)
Load Balancer → GET /health → Backend C: no response ✗ (removed)

The shallow health check trap¶

# BAD — always returns 200 if the web framework is running
@app.get("/health")
def health():
    return {"status": "ok"}
# This checks: "is Python running?" Not: "can the app serve requests?"

The framework is running. Python is alive. But the database connection pool is exhausted, Redis is unreachable, and the disk is full. The health check returns 200 while every real request returns 500.

# GOOD — checks actual dependencies
@app.get("/health")
def health():
    checks = {}

    # Check database
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = str(e)
        return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)

    # Check Redis
    try:
        redis.ping()
        checks["redis"] = "ok"
    except Exception as e:
        checks["redis"] = str(e)
        return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)

    return {"status": "healthy", "checks": checks}

Gotcha: Deep health checks can cause cascading failures. If the database is slow (not down), the health check times out. The LB removes the backend. Now all traffic goes to remaining backends, which also time out on the health check. All backends removed. Health checks killed your entire service because the database was slow.

Fix: Separate liveness (shallow: "is the process alive?") from readiness (deep: "can it serve traffic?"). Liveness should never check dependencies. Readiness should.

L4 vs L7: What the Load Balancer Sees¶

Layer 4 (TCP)¶

The LB works with TCP connections. It can see: source IP, destination IP, ports. It CANNOT see: HTTP headers, URLs, cookies, response bodies.

L4 health check: "Can I open a TCP connection to port 8080?"
  → Connection succeeds = healthy
  → Connection refused = unhealthy
  → Timeout = unhealthy

An L4 health check only proves the port is open. The app could be returning 500 on every request, and the L4 check still shows green.

Layer 7 (HTTP)¶

The LB terminates the HTTP connection and inspects the request/response. It can see everything: headers, status codes, URLs, cookies.

L7 health check: "Does GET /health return 200?"
  → 200 OK = healthy
  → 503 = unhealthy
  → Timeout = unhealthy

L7 is more accurate but more expensive (must parse HTTP, terminate TLS).

	L4 (TCP)	L7 (HTTP)
What it checks	TCP connection succeeds	HTTP status code
Can see content	No	Yes (headers, body, URL)
Can route by path	No	Yes (`/api` → backend A, `/static` → backend B)
TLS termination	No (passthrough) or yes	Yes (always)
Performance	Faster (less processing)	Slower (HTTP parsing)
False positives	High (port open ≠ app works)	Lower (can check response)

Connection Draining: The Deploy Disaster¶

You deploy a new version. The LB removes the old backend and adds the new one. But the old backend had 150 active connections mid-request. Without draining, those connections get RST (reset) — 150 users see "connection reset" errors.

WITHOUT draining:
  11:00:00 — Deploy starts
  11:00:01 — Old backend removed from LB pool
  11:00:01 — 150 active connections killed (RST)
  11:00:01 — 150 users see errors

WITH draining:
  11:00:00 — Deploy starts
  11:00:01 — Old backend marked "draining" (no NEW connections)
  11:00:01 — 150 existing connections continue normally
  11:00:30 — Last connection finishes
  11:00:30 — Old backend removed (0 active connections)
  11:00:30 — 0 users see errors

# HAProxy: drain a server (stop new connections, finish existing)
echo "set server app_servers/app1 state drain" | socat stdio /var/run/haproxy.sock

# Kubernetes: this happens automatically during rolling updates
# Pod gets SIGTERM → preStop hook → terminationGracePeriodSeconds
# During this window, the pod is removed from Endpoints (no new traffic)
# but existing requests finish

Gotcha: If your terminationGracePeriodSeconds (default 30s) is shorter than your longest request, some requests will be killed by SIGKILL when the grace period expires. Long-running requests (file uploads, report generation) need longer grace periods.

Sticky Sessions: The Hidden Coupling¶

Sticky sessions (session affinity) route a user to the same backend for the duration of their session. Sounds useful. It's usually a trap.

Request 1 from User A → Backend 1 (session created in memory)
Request 2 from User A → Backend 1 (sticky — session found)
Request 3 from User A → Backend 1 (still sticky)

Backend 1 crashes.

Request 4 from User A → Backend 2 (session is GONE — logged out, cart empty)

Sticky sessions hide a design problem: storing session state in local memory instead of an external store (Redis, database). When the backend dies, the session dies with it.

Mental Model: Sticky sessions are training wheels. They let you pretend each server is independent while actually coupling users to specific servers. Remove the training wheels: store sessions externally. Then any backend can serve any user, and backend failures are transparent.

The Health Check Cascade¶

Aggressive health checks can kill a healthy system:

Normal:
  Health check interval: 1s, fail threshold: 2 misses
  All 5 backends healthy, serving 1000 req/s

CPU spike (batch job, GC pause):
  Backend A health check takes 1.2s (just over 1s interval)
  LB misses 2 consecutive checks → marks A unhealthy → removes from pool

Traffic redistributes:
  4 backends now handle 1000 req/s (was 200 each, now 250 each)
  CPU on remaining backends increases by 25%

Cascade:
  Backend B now under more load → health check latency increases
  Backend B misses checks → removed
  3 backends handle 1000 req/s (333 each) → more load → more removals
  Eventually: 1 backend handles all traffic → crashes → total outage

War Story: A team set health check interval to 1 second with failure threshold of 2. A brief CPU spike from a JVM full GC (300ms pause) caused one backend to miss two consecutive checks. The LB removed it. The increased load on remaining backends caused more GC pauses, more missed checks, more removals. All 5 backends were removed within 90 seconds. The original GC pause lasted 300ms. The cascading outage lasted 12 minutes.

Fix: Health check interval 5-10s, failure threshold 3-5. Slow start (gradually increase traffic to recovering backends). Don't make health checks faster just because you want "faster detection" — the cascade risk outweighs the detection speed.

Flashcard Check¶

Q1: Health check returns 200 but users see 500s. What's wrong?

Shallow health check. The check verifies the framework is running, not that dependencies (database, Redis, etc.) are working. Use deep readiness checks.

Q2: L4 vs L7 health check — which catches a broken app returning 500?

L7. It checks the HTTP status code. L4 only checks if the TCP port is open — an app returning 500 on every request still has port 8080 open.

Q3: What is connection draining?

When removing a backend, stop sending NEW connections but let existing connections finish. Without draining, active connections are killed (RST) during deploys.

Q4: Why are sticky sessions usually a bad idea?

They couple users to specific backends. When that backend dies, the session is lost. Store sessions externally (Redis) so any backend can serve any user.

Q5: How can aggressive health checks cause an outage?

Short interval + low failure threshold → brief CPU spike removes a backend → remaining backends get more load → more miss health checks → cascade removes all backends.

Cheat Sheet¶

Health Check Configuration¶

Setting	Safe default	Too aggressive
Interval	5-10s	1s (cascade risk)
Timeout	3-5s	1s (GC pauses trigger false failures)
Failure threshold	3-5 misses	1-2 (too sensitive)
Success threshold	2 passes	1 (premature return)

Health Check Types¶

Type	What it checks	Use for
TCP (L4)	Port is open	Basic connectivity
HTTP (L7) shallow	Framework responds 200	Liveness ("is it alive?")
HTTP (L7) deep	Dependencies work	Readiness ("can it serve?")

HAProxy Commands¶

Task	Command
Show server status	`echo "show stat" \\| socat stdio /var/run/haproxy.sock`
Drain server	`echo "set server pool/server1 state drain" \\| socat ...`
Enable server	`echo "set server pool/server1 state ready" \\| socat ...`
Disable server	`echo "set server pool/server1 state maint" \\| socat ...`

Takeaways¶

Shallow health checks lie. "Port is open" and "framework responds" don't mean "app works." Check actual dependencies in readiness probes.
But deep checks cascade. If the database is slow, deep checks timeout, LB removes all backends. Separate liveness (shallow) from readiness (deep).
Connection draining prevents deploy errors. Without it, active requests get RST during deploys. Set terminationGracePeriodSeconds longer than your longest request.
Health check timing is a tradeoff. Fast detection (1s interval) = high cascade risk. Safe detection (10s interval) = slower failover. Default to 5-10s, fail after 3-5 misses.
Sticky sessions are training wheels. Store sessions externally. Then any backend serves any user, and failover is transparent.

The Cascading Timeout — when health check cascades cause total outage
Connection Refused — what users see when backends are removed
Deploy a Web App From Nothing — the Nginx reverse proxy layer
The Nginx Config That Broke Everything — proxy configuration gotchas