When the Queue Backs Up
- lesson
- kafka
- rabbitmq
- message-queues
- consumer-lag
- backpressure
- dead-letter-queues
- exactly-once
- l2 ---# When the Queue Backs Up
Topics: Kafka, RabbitMQ, message queues, consumer lag, backpressure, dead letter queues, exactly-once Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic understanding of producer/consumer pattern helpful
The Mission¶
Your dashboard shows Kafka consumer lag climbing: 10,000... 50,000... 200,000 messages. The producers are fine. The consumers are running. But the gap between "messages produced" and "messages consumed" is growing every second.
Somewhere between production and consumption, something broke. And unlike HTTP where a slow response is visible immediately, a queue can hide problems for hours — silently accumulating a backlog that eventually cascades into data loss, duplicate processing, or complete service failure.
Why Queues Exist (And Why They're Dangerous)¶
Queues decouple producers from consumers. The producer writes a message and moves on. The consumer reads it whenever it's ready. This solves:
- Speed mismatch: Producer writes 10,000 msg/sec, consumer handles 1,000 msg/sec → queue absorbs the burst
- Availability: Consumer crashes → messages wait in the queue until it restarts
- Fan-out: One message → multiple consumers (pub/sub)
But queues also hide problems:
Without a queue:
Producer → Consumer (slow) → Producer times out → Error visible IMMEDIATELY
With a queue:
Producer → Queue (fast) → Consumer (slow)
Producer sees: "Success! Message sent!"
Consumer lag: 10k... 50k... 200k... (problem invisible for hours)
Eventually: Queue fills → messages dropped or producer blocked
Mental Model: A queue is a buffer between two systems that run at different speeds — like a reservoir between a river and a city's water system. It absorbs surges. But if the outflow is permanently slower than the inflow, the reservoir overflows. And unlike HTTP, nobody sees the water level rising until it's too late.
Diagnosing Consumer Lag¶
Kafka¶
# Check consumer lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group my-consumer-group --describe
# → GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# → my-group orders 0 45000 52000 7000
# → my-group orders 1 43000 51500 8500
# → my-group orders 2 44500 52200 7700
# TOTAL LAG: 23200
| Column | Meaning |
|---|---|
CURRENT-OFFSET |
Last message the consumer committed |
LOG-END-OFFSET |
Last message produced to the partition |
LAG |
The difference — messages waiting to be processed |
If lag is stable: consumer is keeping up (lag from brief blip, no problem). If lag is growing: consumer is slower than producer. Problem.
RabbitMQ¶
# Check queue depth
rabbitmqctl list_queues name messages consumers
# → orders 52000 3
# Via management API
curl -s http://admin:admin@localhost:15672/api/queues | \
jq '.[] | {name, messages, consumers}'
Root Cause 1: Consumer Is Slow¶
The most common cause. The consumer processes each message too slowly — usually because it makes a blocking call (database query, HTTP request, file write).
Production rate: 1,000 msg/sec
Consumer rate: 200 msg/sec (each message triggers a 5ms DB query)
Lag growth: 800 msg/sec → 48,000/min → 2.88M/hour
Fixes (in order of impact):
-
Scale consumers horizontally. In Kafka, add more consumers to the group (up to the number of partitions). In RabbitMQ, add more consumers to the queue.
-
Batch processing. Instead of one DB query per message, batch 100 messages into one bulk insert.
-
Async processing. If the consumer makes HTTP calls, use async I/O to process multiple messages concurrently.
-
Optimize the slow path. Profile the consumer code. Is the DB query missing an index? Is the HTTP call timing out?
Gotcha (Kafka): You can't have more consumers than partitions in a consumer group. If your topic has 3 partitions and you start 5 consumers, 2 will be idle. You need to increase partitions and consumers to scale. But increasing partitions is irreversible in Kafka — you can add but never remove them.
Root Cause 2: Consumer Is Stuck¶
The consumer process is running (healthy, passing health checks) but not making progress. Common causes:
- Poison message: A malformed message that causes the consumer to throw an exception, fail processing, and retry the same message forever.
- Rebalancing storm (Kafka): Consumer group repeatedly rebalances, and during rebalancing, no messages are processed.
- Lock contention: Consumer is waiting on a database lock or external resource.
The poison message pattern¶
Message 1: ✓ Processed
Message 2: ✓ Processed
Message 3: ✗ Exception → retry → ✗ Exception → retry → ✗ ...
Message 4: Waiting behind message 3
Message 5: Waiting
(Entire partition blocked by one bad message)
Fix: Dead Letter Queue (DLQ)
After N failed attempts, move the message to a separate "dead letter" queue for human inspection. The consumer moves on.
# Pseudocode
MAX_RETRIES = 3
def process_message(msg):
for attempt in range(MAX_RETRIES):
try:
handle(msg)
return # Success
except Exception as e:
log.warning(f"Attempt {attempt+1} failed: {e}")
# All retries exhausted — send to DLQ
send_to_dlq(msg, error=str(e))
log.error(f"Message sent to DLQ after {MAX_RETRIES} attempts: {msg.key}")
Root Cause 3: Queue Disk Full¶
Kafka stores messages on disk for a configurable retention period. If producers write faster than old messages expire, the disk fills:
# Check Kafka disk usage
du -sh /var/lib/kafka/data/
# Check retention settings
kafka-configs.sh --describe --topic orders --bootstrap-server localhost:9092
# → retention.ms=604800000 (7 days)
# → retention.bytes=-1 (unlimited)
War Story: An unbounded Kafka topic retained all messages (default
retention.bytes=-1). Producers wrote 50GB/day of event data. Nobody noticed because lag was zero — consumers were keeping up. After 60 days: 3TB of retained messages. Disk at 95%. Kafka brokers started rejecting new messages. Fix: setretention.bytesorretention.msbased on actual consumption needs, not "keep everything."
Root Cause 4: Offset Management Gone Wrong¶
In Kafka, the consumer tracks its position in each partition using an offset — an integer that says "I've processed up to message N." If offset management goes wrong:
-
Auto-commit loses messages: Consumer reads 100 messages, auto-commits offset. Consumer crashes before processing them all. Messages 50-100 are lost (committed but not processed).
-
Offset reset to earliest: Consumer restarts with
auto.offset.reset=earliest. It re-reads the entire topic history — millions of messages. Processing time: hours. Duplicate processing of everything.
# Kafka consumer best practice: commit AFTER processing
consumer = KafkaConsumer(
'orders',
enable_auto_commit=False, # Manual commit only
auto_offset_reset='latest'
)
for msg in consumer:
process(msg) # Process first
consumer.commit() # Then commit offset
The Delivery Guarantees¶
| Guarantee | Meaning | Risk |
|---|---|---|
| At-most-once | Message delivered 0 or 1 times | Can lose messages |
| At-least-once | Message delivered 1+ times | Can duplicate messages |
| Exactly-once | Message delivered exactly 1 time | Requires careful design |
Most systems use at-least-once (the safest default) and make consumers idempotent — processing the same message twice produces the same result. This is much simpler than achieving exactly-once.
# Idempotent consumer: use message ID to deduplicate
def process_order(msg):
order_id = msg.value['order_id']
if db.exists(f"processed:{order_id}"):
log.info(f"Already processed {order_id}, skipping")
return
# Process the order
db.insert_order(msg.value)
db.mark_processed(order_id)
Flashcard Check¶
Q1: Kafka consumer lag is 50,000 and growing. What does this mean?
Consumers are processing messages slower than producers are creating them. The gap will keep growing until you add consumers, speed up processing, or reduce production.
Q2: Why can't you have more Kafka consumers than partitions?
Each partition is assigned to exactly one consumer in a group. Extra consumers sit idle. To scale, increase both partitions and consumers. Partitions can never be decreased.
Q3: What is a dead letter queue?
A separate queue for messages that fail processing after N retries. The consumer moves on instead of blocking on a poison message. Humans review the DLQ.
Q4: At-most-once vs at-least-once — what's the default for most systems?
At-least-once. The consumer might process a message twice (crash after processing but before committing offset). Make consumers idempotent to handle duplicates.
Q5: Auto-commit with Kafka — what's the risk?
Consumer reads messages, auto-commits offset, then crashes before processing them all. Those messages are lost (offset says "done" but processing didn't happen).
Cheat Sheet¶
Kafka Quick Checks¶
| Task | Command |
|---|---|
| Consumer lag | kafka-consumer-groups.sh --describe --group GROUP |
| Topic retention | kafka-configs.sh --describe --topic TOPIC |
| Topic disk usage | du -sh /var/lib/kafka/data/TOPIC-* |
| List consumer groups | kafka-consumer-groups.sh --list |
| Reset offsets (careful!) | kafka-consumer-groups.sh --reset-offsets --to-latest |
RabbitMQ Quick Checks¶
| Task | Command |
|---|---|
| Queue depth | rabbitmqctl list_queues name messages consumers |
| Node health | rabbitmqctl status |
| Connections | rabbitmqctl list_connections |
| Management UI | http://localhost:15672 (guest/guest) |
Takeaways¶
-
Queues hide problems. HTTP fails visibly. Queues fail silently — lag grows for hours before anyone notices. Monitor consumer lag, not just producer success.
-
Scale consumers to match producers. In Kafka, consumers ≤ partitions. Plan partition count for your peak throughput, not your current average.
-
Dead letter queues prevent poison messages. One bad message blocking an entire partition is worse than skipping it and investigating later.
-
Commit after processing, not before. Auto-commit risks losing messages. Manual commit after processing risks duplicates. Choose duplicates (at-least-once) and make consumers idempotent.
-
Set retention limits. Unbounded retention fills disks. Calculate how much retention you actually need and set
retention.msorretention.bytes.
Related Lessons¶
- The Cascading Timeout — when queue backpressure isn't implemented
- The Disk That Filled Up — when Kafka retention fills the disk
- Out of Memory — when consumers buffer too many messages in memory