Skip to content

The Load Balancer Lie

Category: The Incident Domains: load-balancing, monitoring Read time: ~5 min


Setting the Scene

Series B startup, real-time analytics platform. About 60 engineers, maybe 30 services behind an AWS ALB. I was the second SRE hire, still building out our observability stack. We had Datadog for metrics, PagerDuty for alerting, and health checks on everything. Or so I thought. Our biggest customer was a Fortune 500 retail chain that used our dashboard during Black Friday. Which is when this story takes place.

What Happened

Black Friday, 7:42 AM — Our sales VP messages the SRE channel: "Customer says dashboard is showing errors. Can you check?" I look at our status page — all green. I check the ALB target group — all targets healthy. Datadog service dashboard — all green. I respond: "Everything looks healthy on our end. Can they send a screenshot?"

7:48 AM — They send a screenshot. It's a 502 Bad Gateway error. From our domain. On a page that definitely exists.

7:55 AM — I start digging. I curl our API from my machine: works fine. I curl it again: 502. Then fine. Then 502. Intermittent. I check the ALB access logs — about 30% of requests to /api/v2/analytics are returning 502. But the health check is green.

8:03 AM — I look at what the health check actually hits: GET /health. I look at what /health does: it returns {"status": "ok"} with a hardcoded 200. It doesn't check the database. It doesn't check Redis. It doesn't check the analytics engine. It's a static JSON response. The app could be on fire and /health would cheerfully report everything is fine.

8:08 AM — The actual problem: our analytics query engine connects to a ClickHouse cluster. One of three ClickHouse nodes had a full disk. Queries that routed to that node failed. The app returned 500 on those requests, the ALB converted them to 502s. But /health never touched ClickHouse, so the health check was completely blind to this failure.

8:15 AM — We SSHed into the ClickHouse node, cleared some old data, restarted the service. Errors stopped immediately.

8:20 AM — I realized we'd been lying to ourselves. Our health check was the operational equivalent of asking "are you alive?" and accepting "I have a pulse" as proof of good health. The patient was hemorrhaging, but hey, pulse is strong.

The Moment of Truth

The ALB was doing its job perfectly. It was checking health exactly where we told it to. We told it to check a useless endpoint. The lie wasn't in the load balancer — it was in our definition of "healthy."

The Aftermath

We redesigned health checks across all services that week. Every service got three endpoints: /livez (process is running — for restart decisions), /readyz (can serve traffic — checks dependencies), and /healthz (deep check for monitoring). The ALB health check was pointed at /readyz. We also added Datadog synthetic checks that hit actual API endpoints every 30 seconds from multiple regions. The ClickHouse disk issue led to adding disk usage alerts at 70%, 80%, and 90% thresholds on every data node.

The Lessons

  1. Health checks should test real functionality: A health endpoint that returns static 200 is a lie. It must verify the service can actually do its job — database connections, cache availability, downstream dependencies.
  2. Distinguish liveness from readiness: Liveness means "should I restart this?" Readiness means "should I send traffic here?" They are different questions requiring different checks.
  3. Monitor from the user's perspective: If your internal checks say "green" but users see errors, your checks are wrong. Synthetic monitoring from external locations catches what internal checks miss.

What I'd Do Differently

I'd implement a health check review as part of every service's production readiness checklist. Before any service goes to production, someone has to explain what the health endpoint actually tests and what failure modes it would miss. I'd also run a quarterly "health check chaos" exercise where we break one dependency at a time and verify the health check correctly reports unhealthy.

The Quote

"The load balancer didn't lie to us. We lied to the load balancer."

Cross-References