Skip to content

When the Queue Backed Up

Category: The Incident Domains: message-queues, capacity-planning Read time: ~5 min


Setting the Scene

I was a senior SRE at an insurance claims processing company. About 800 employees, a core platform team of five. Our architecture was classic event-driven: claims came in through a web portal, got dropped onto a RabbitMQ cluster, and workers consumed them for validation, fraud scoring, adjuster assignment, and notification. On a normal day we processed about 50,000 messages. The queue depth hovered around 200 at most. We had monitoring on queue depth, but the alert threshold was set at 10,000. Nobody thought we'd ever hit it.

What Happened

Friday 5:00 PM — A partner insurance company starts a batch upload of 180,000 historical claims they'd been holding for a data reconciliation project. Nobody told us this was happening. The messages start flooding in.

Friday 5:45 PM — Queue depth passes 10,000. The alert fires. I'm already in weekend mode. I check Grafana from my phone, see the spike, and think "probably a small burst, it'll drain." I acknowledge the alert and go back to dinner.

Saturday 8:00 AM — Queue depth is at 140,000. The workers are consuming, but they're CPU-bound on the fraud scoring model and can only process about 800 messages/minute. At this rate, the backlog will take almost three hours to clear. But it's getting worse.

Saturday 10:00 AM — RabbitMQ's memory alarm triggers at 160,000 messages. It starts applying backpressure — blocking publishers. The web portal's publish calls start timing out. New claims from regular customers can't be submitted. The customer service line starts ringing.

Saturday 11:00 AM — The publisher timeouts cascade into the web app. Connection pool exhaustion in the portal's RabbitMQ client. The web portal starts returning 503s. Now two systems are down because of a queue.

Saturday 11:30 AM — I'm frantically spinning up additional consumer instances on spare EC2 capacity. But the fraud scoring model requires a GPU instance, and our AWS account has a limit of 4 g4dn.xlarge instances. We're already using all four. I request a limit increase — estimated response: 24 hours.

Saturday 12:00 PM — We make the hard call: bypass fraud scoring for the batch claims by routing them to a simplified consumer that skips the GPU model. The batch claims get flagged for manual fraud review later. Queue starts draining at 3,000 messages/minute.

Saturday 2:30 PM — Queue depth back to normal. Web portal recovered at 12:15 PM once backpressure lifted.

The Moment of Truth

We'd never asked "what happens when queue depth exceeds our consumption capacity?" We had a queue depth alert, but no automated response. No rate limiting on publishers. No priority lanes. No capacity headroom for burst. The architecture assumed the happy path forever.

The Aftermath

We implemented publisher rate limiting with a token bucket. Priority queues separated real-time customer claims from batch uploads — batch could be throttled without affecting live traffic. We added autoscaling for consumers based on queue depth, with pre-provisioned warm instances. The alert threshold got tiered: warning at 5,000, critical at 15,000, with an automated runbook that spun up extra consumers at 10,000. We also set up a "big batch" intake process where partners had to schedule bulk uploads.

The Lessons

  1. Queue depth monitoring with teeth: An alert that fires and gets acknowledged isn't protection. Queue depth alerts need automated responses — autoscaling consumers, rate limiting publishers, or paging with urgency.
  2. Backpressure handling is an architecture decision: If your queue applies backpressure to publishers and those publishers are user-facing, you've coupled your user experience to your processing capacity. Design for it.
  3. Capacity plan for burst, not average: A system that handles 50K messages/day gracefully can be destroyed by 180K in an hour. Know your burst tolerance and enforce it.

What I'd Do Differently

I'd implement priority queues from day one — live traffic and batch processing should never compete for the same consumer pool. I'd also set hard limits on publisher rates with clear error responses ("batch too large, use the bulk API") instead of silently accepting unbounded input and hoping the consumers keep up.

The Quote

"The queue didn't back up. We designed a system that couldn't say 'slow down' until it was already drowning."

Cross-References