The Cascading Timeout¶
Category: The Incident Domains: microservices, networking Read time: ~5 min
Setting the Scene¶
I was an SRE at an online food delivery platform — about 400 engineers, 12 million orders per month, running roughly 45 microservices on Kubernetes behind an Istio service mesh. Peak lunch hour was our Super Bowl: 800 orders per minute, every service talking to four or five other services, latency budgets measured in milliseconds. We had monitoring, we had alerting, we had distributed tracing. What we didn't have was circuit breakers.
What Happened¶
Wednesday 11:52 AM — Eight minutes before peak lunch. A routine query in the menu-service starts running slow. The database (Aurora MySQL) has a query plan regression on a menu item lookup — the optimizer picks a table scan instead of an index lookup after an autovacuum changes table statistics. Query time goes from 3ms to 2.8 seconds.
11:53 AM — The menu-service has a 5-second timeout on its HTTP handler. The slow query doesn't breach it, so the service doesn't error — it just responds slowly. Every call to GET /api/menu/{restaurant_id} now takes ~3 seconds instead of ~50ms.
11:54 AM — The order-service calls menu-service to validate menu items during checkout. It has a 10-second timeout. Each order validation now takes 3+ seconds instead of 100ms. The order-service's request handlers start piling up. Goroutine count goes from 200 to 8,000 in two minutes.
11:55 AM — The order-service runs out of available connections. New requests start queuing. The api-gateway calls order-service with a 15-second timeout. Gateway requests start queuing too. The user-facing app shows spinning loaders.
11:56 AM — The delivery-dispatch service also calls order-service for status updates. Its 30-second timeout means those connections hang for the full 30 seconds before timing out. The dispatch service's connection pool is exhausted. Active drivers stop receiving new delivery assignments.
11:57 AM — Complete platform outage. No orders can be placed, no deliveries can be dispatched, no menus can be loaded. All 45 services are affected because the timeout cascade has consumed all connection pools, goroutine pools, and thread pools across the mesh.
12:02 PM — We identify menu-service as the root cause through distributed tracing (Jaeger). One trace shows the call chain: gateway -> order -> menu -> Aurora, with menu taking 2.8 seconds. I SSH into the Aurora writer instance and run SHOW PROCESSLIST — 400 concurrent queries on the menu items table, all doing full table scans.
12:05 PM — I run ANALYZE TABLE menu_items to refresh the table statistics. Aurora's optimizer picks up the correct index immediately. Query times drop back to 3ms. But the platform doesn't recover — all the connection pools are still exhausted with stuck requests.
12:08 PM — We do a rolling restart of the three most-affected services: menu-service, order-service, and api-gateway. Connection pools reset. Traffic starts flowing. Full recovery by 12:12 PM. Total outage: approximately 15 minutes during peak lunch.
12:15 PM — Our business team estimates 12,000 lost orders. At our average order value, that's about $180,000 in GMV. From one slow database query.
The Moment of Truth¶
One service got slow and every other service waited politely for it to respond, consuming resources the whole time. Nobody said "this is taking too long, I'll serve a degraded response instead." We'd built a system where every service trusted every other service to be fast, and when that trust was violated, the entire platform collapsed. Circuit breakers would have contained the blast radius to menu-service and order validation, while delivery dispatch and everything else kept working.
The Aftermath¶
We spent the next month implementing circuit breakers across the service mesh using Istio's outlier detection and application-level circuit breakers (using gobreaker in Go services). We set aggressive timeouts: 500ms for internal service calls, 2 seconds for database queries. Every service got a degraded-response mode: order-service could accept orders with deferred menu validation, delivery-dispatch could operate from its local cache for 5 minutes. We also added connection pool monitoring with alerts at 70% utilization. The database got query plan monitoring through Performance Insights, with alerts on plan regressions.
The Lessons¶
- Circuit breakers are mandatory in microservices: Without circuit breakers, a slow service becomes a down platform. Implement them at both the service mesh level (Istio outlier detection) and the application level. This is not optional.
- Set aggressive timeouts: A 30-second timeout on an internal service call is not a timeout — it's a slow death. Internal service calls should timeout in hundreds of milliseconds, not tens of seconds.
- Graceful degradation over total failure: Every service should have a degraded mode: cached responses, partial results, deferred processing. Serving a slightly stale menu is infinitely better than serving nothing.
What I'd Do Differently¶
I'd implement the "bulkhead pattern" from day one: isolate connection pools per downstream dependency so one slow service can't exhaust the pool used by other services. I'd also set up load shedding at the API gateway level — when response times exceed SLO, start returning 503s to new requests immediately rather than queuing them. Fast failure is better than slow failure. And I'd run regular "slow dependency" chaos tests where we artificially add latency to one service and verify the rest of the platform degrades gracefully.
The Quote¶
"Fifteen minutes of total outage, $180,000 in lost orders, forty-five services down — all from one bad query plan on a table with an index that was right there the whole time."
Cross-References¶
- Topic Packs: Distributed Systems, Istio, Database Ops, Load Balancing
- Case Studies: Kubernetes Ops