Portal | Level: L2 | Domain: DevOps

Message Queues Footguns¶

Common ways engineers shoot themselves in the foot with message queues. Each entry includes what goes wrong, why it is non-obvious, and how to fix it.

1. No Dead-Letter Queue Configured¶

What happens: A poison message (malformed payload, triggers a bug, or references a deleted resource) causes the consumer to throw an exception. With no DLQ, you have two bad options: nack with requeue (message loops forever, blocking the queue) or nack without requeue (message is silently dropped and lost).

Why it is non-obvious: Everything works fine in testing because test messages are well-formed. The first production deployment of broken message data reveals the gap, usually under pressure.

Consequences: - With requeue: the poison message cycles endlessly. Other messages pile up behind it. Consumer CPU spikes. Alert fires. It looks like a consumer bug, not a data bug. - Without requeue (drop): data is silently lost. You find out hours later when the database is in an inconsistent state.

Fix: - Configure DLQ before going to production. Zero exceptions. - RabbitMQ: declare queues with x-dead-letter-exchange and x-message-ttl or x-max-retries. - Kafka: implement DLQ logic in the consumer — catch exceptions and produce to topic.dlq before committing offset. - Alert on DLQ depth > 0. A non-empty DLQ is always an event requiring investigation.

2. Non-Idempotent Consumers with At-Least-Once Delivery¶

What happens: The consumer processes a message, triggers a payment charge or sends an email, then crashes before committing the offset or sending the ack. The broker redelivers. The payment is charged twice. The customer gets two confirmation emails.

Why it is non-obvious: The duplicate only happens on crash paths, which are rare in development. At-least-once delivery is the safe default — it is the consumer's responsibility to handle duplicates, but that contract is easy to miss.

Consequences: duplicate charges, duplicate records, duplicate notifications — all visible to end users and very hard to unwind after the fact.

Fix: - Treat idempotency as a first-class design requirement for every consumer, not an afterthought. - Deduplication strategies: - Store a processed_message_id in the same DB transaction as the business operation. - Use conditional updates: UPDATE orders SET status='paid' WHERE id=$1 AND status='pending' — safe to run twice. - Redis SET NX EX for fast, short-lived deduplication windows. - Assign stable message IDs at the producer and propagate them through retries.

3. Unbounded Queue Growth (No Backpressure)¶

What happens: Producers publish faster than consumers can process. The queue grows indefinitely. Eventually the broker runs out of disk (Kafka) or memory (RabbitMQ), crashes, or triggers flow control that blocks all producers including healthy ones.

Why it is non-obvious: During development, producers and consumers run in balance. High-traffic production load or a slow consumer (due to a downstream DB issue) reveals the imbalance hours into an incident.

Consequences: - RabbitMQ memory alarm fires: all producers blocked, entire system stalls. - Kafka broker fills disk: partition becomes read-only, producers get KafkaException. - Even before exhaustion: hours of backlog = hours of latency spike when consumers catch up.

Fix: - Set queue length limits on RabbitMQ queues (x-max-length, x-max-length-bytes) with overflow policy (reject-publish or drop-head). - Set Kafka topic retention by time and size: retention.ms, retention.bytes. - Set producer max.block.ms to fail fast rather than block indefinitely when broker is full. - Monitor queue depth and consumer lag as primary SLIs. Alert early (e.g., lag > 5 minutes of expected throughput). - Design consumers to scale horizontally in response to lag (autoscaler based on queue metrics).

4. Consumer Group Rebalance Storms¶

What happens: A consumer takes too long to process a batch and exceeds max.poll.interval.ms. Kafka considers it dead and triggers a rebalance. During rebalance, all consumers in the group pause. The slow consumer catches up and rejoins, triggering another rebalance. The cycle repeats every few minutes.

Why it is non-obvious: Rebalances are a normal Kafka event (expected on startup, scale events). A rebalance storm looks superficially like "Kafka is slow" until you check the coordinator logs and see continuous group churn.

Consequences: consumer lag grows continuously because consumers spend more time paused in rebalance than actually consuming. SLA breach without an obvious single point of failure.

Fix:

# Increase poll interval to match actual processing time
max.poll.interval.ms=300000      # 5 minutes; set higher than worst-case batch time

# Reduce records per poll so processing stays under the interval
max.poll.records=50

# Use cooperative rebalancing to avoid full stop-the-world
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

# Static membership — don't trigger rebalance on expected restarts
group.instance.id=consumer-host-1

5. Producing Without Acks (Fire-and-Forget in Kafka)¶

What happens: Producer is configured with acks=0 (or acks=1) to maximize throughput. A broker leader fails between the write and replication. The producer receives no error. The message is silently lost.

Why it is non-obvious: acks=0 delivers the highest throughput and lowest latency in benchmarks. It works perfectly until it does not. Message loss during broker failure is silent — there is no error returned to the application.

Consequences: undetected data loss. Order events, payment records, audit log entries vanish permanently with no error in the producer logs.

Fix:

# Always use acks=all for important topics
acks=all

# Combined with idempotent producer to prevent duplicates during retries
enable.idempotence=true

# Retry indefinitely (let max.block.ms be the timeout signal)
retries=2147483647
delivery.timeout.ms=120000

Reserve acks=0 only for genuinely disposable data like metrics samples or non-critical telemetry where occasional loss is explicitly acceptable and documented.

6. Single Partition Bottleneck¶

What happens: Topic is created with a single partition (Kafka default) or all messages are routed to one partition via a constant partition key (e.g., null key with custom partitioner, or all messages keyed on a single value). Throughput is capped at one consumer and one broker's I/O capacity.

Why it is non-obvious: Single-partition topics work fine at low volume. Scaling to high throughput reveals the bottleneck — and adding more consumers does nothing because only one consumer can be assigned to a partition at a time.

Consequences: a 12-broker, 36-consumer deployment runs at the speed of one broker partition. Horizontal scaling buys nothing.

Fix: - Plan partition count before creating topics. Rule of thumb: start with (target_throughput_MBps / 10) * 2 partitions, rounded up to the next power of two. - Do not use null/constant keys if you need horizontal scaling. Choose a high-cardinality key. - Changing partition count after the fact triggers a rebalance and breaks per-key ordering for existing messages. - For topics that genuinely require strict global ordering: single partition is correct, but document the throughput ceiling and plan accordingly.

7. Not Monitoring Consumer Lag¶

What happens: The queue is accumulating a backlog but no alert fires. An hour into an incident, someone checks manually and discovers 500,000 unprocessed messages. The consumer silently died 45 minutes ago.

Why it is non-obvious: the application "works" — producers still succeed, broker is healthy. Only the consumer is stuck. Without active lag monitoring, this failure mode is invisible until users complain or a downstream system is noticeably stale.

Consequences: stale data delivered to users, SLA breach, and a long recovery period draining the backlog after the consumer is fixed.

Fix: - Add consumer lag to your primary alerting. Treat growing lag as an incident. - Prometheus + Kafka Exporter: alert on kafka_consumer_group_lag > threshold for any group for more than N minutes. - RabbitMQ: alert on rabbitmq_queue_messages_ready > threshold AND rabbitmq_queue_messages_unacknowledged > threshold. - Set two alert thresholds: warning (lag growing for 5 min) and critical (lag exceeds 10 minutes of expected throughput). - Use Burrow for Kafka lag trend analysis — it distinguishes a normal processing pause from a true stall.

8. Message Ordering Assumptions Across Partitions¶

What happens: Engineer designs a system assuming all messages for a given entity arrive in order. Works perfectly in single-partition testing. In production with 12 partitions, messages for the same user occasionally arrive out of order, causing state machine corruption: account.closed processed before account.created.

Why it is non-obvious: Kafka guarantees ordering within a partition, not across partitions. If you use a partition key, all messages for a key land on the same partition — but only if the partition count does not change. Adding partitions re-routes some keys to new partitions, breaking ordering for in-flight messages during the transition.

Consequences: corrupted derived state, ordering-sensitive workflows failing in subtle non-deterministic ways that are very hard to reproduce.

Fix: - Use the entity ID (user ID, order ID, account ID) as the partition key for all topics where ordering matters. - Do NOT change partition count on ordering-sensitive topics without a coordinated migration plan. - Add event sequence numbers to messages and validate ordering at the consumer. Reject or re-queue out-of-order messages. - For workflows requiring strict ordering across multiple event types: use a single partition for the entity (accept throughput limitation) or use an event-sourcing framework that enforces ordering.

9. TTL/Expiry Misconfiguration Dropping Important Messages¶

What happens: Queue or topic is configured with a short TTL for hygiene reasons, but messages pile up due to a consumer outage. When the consumer recovers, the backlog has expired. Important order events, user actions, or audit records are permanently gone.

Why it is non-obvious: The TTL was added to prevent unbounded queue growth — a reasonable concern. The conflict between "don't keep stale data" and "don't lose important data" is only visible during extended outages.

Consequences: permanent data loss that may not be discovered until reconciliation surfaces inconsistencies days later.

Fix: - Classify messages before setting TTL. Distinguish ephemeral (heartbeats, metrics aggregates) from durable (business events, user actions, audit records). - For durable events: set TTL only as a last resort and make it very long (7+ days). Prefer x-max-length with reject-publish overflow — reject new messages rather than dropping old ones. - For Kafka: set retention.ms based on consumer group recovery SLA, not storage budget. If consumers can be down for 48 hours, retention must exceed 48 hours. - Monitor topic retention vs. consumer group lag. Alert when max(consumer_group_lag_time) > 0.7 * retention_ms.

10. Blocking Consumer Threads (Sync Processing in Async Pipeline)¶

What happens: Consumer thread blocks on a synchronous external call — a slow database query, an HTTP request to a third-party API, a filesystem operation. With prefetch_count=1 and a thread pool of 4, four slow calls saturate the consumer. No messages are processed while all threads wait. Consumer lag grows. RabbitMQ eventually detects inactivity and kills the connection. All unacked messages return to the queue.

Why it is non-obvious: The consumer "works" at low volume. Only sustained load or a degraded downstream service reveals that synchronous blocking in the consumer thread kills throughput.

Consequences: throughput collapses under load; consumer appears healthy (no errors) but processes almost nothing; intermittent connection drops that look like network issues.

Fix: - Make the consumer thread non-blocking: use async I/O (asyncio in Python, CompletableFuture in Java) or offload to a thread pool separate from the consumer poll loop. - Set timeouts on all external calls within consumers. A call with no timeout inherits infinite blocking. - Use circuit breakers on downstream dependencies. A circuit-open exception is handled fast; a timed-out call blocks. - For RabbitMQ: tune heartbeat to match your expected worst-case blocking time. If a call can block for 60s, heartbeat must be > 60s or the connection drops. - Profile consumer throughput under load before production. A consumer that handles 10 msg/s in isolation may handle only 1 msg/s when the downstream DB is under load.

Primer: - Message Queues Primer

Street Ops: - Message Queues Street Ops

Flashcard Decks: - message-queues (30 cards)