Skip to content

Portal | Level: L2: Operations | Topics: Message Queues | Domain: DevOps & Tooling

Message Queues — Primer

Why Message Queues Matter

Modern distributed systems need to communicate between services without tight coupling. Message queues solve three fundamental problems:

Decoupling: Producers and consumers are independent. The order service does not need to know whether the inventory service, the email service, or the analytics pipeline is up. It posts a message and moves on.

Name origin: The term "message queue" dates to the 1960s mainframe era. IBM's MQSeries (now IBM MQ, first released 1993) popularized the concept for distributed systems. The "queue" is literal -- FIFO (first in, first out), just like a queue of people waiting in line.

Async processing: Long-running work (image resizing, report generation, fraud detection) is handed off to a queue. The HTTP handler returns immediately; a worker processes asynchronously.

Load leveling: A sudden spike of 10,000 orders per second hits the queue rather than the payment processor. Workers drain the queue at a sustainable rate, protecting downstream services from overload.

Without queuing, you are left with synchronous call chains that couple failure domains, block threads, and shed load during spikes by returning 5xx errors.


Message Queue vs Log-Based Broker

Two distinct paradigms are routinely called "message queues." Understanding the difference prevents mis-architecting systems.

Traditional Message Queue (RabbitMQ, ActiveMQ, SQS)

  • Messages are consumed and deleted. Once a worker acks a message, it is gone.
  • Designed for task distribution: each message is processed by exactly one consumer.
  • Queue depth is the backlog; the goal is to drain it.
  • Low per-message retention overhead.
  • Consumers can selectively pull, ack/nack, and requeue.

Log-Based Broker (Apache Kafka, Pulsar, Kinesis)

  • Messages are retained as an immutable log for a configurable time or size window.
  • Designed for event streaming: many independent consumer groups can each replay the full log at their own pace.
  • Position in the log (offset) is tracked per consumer group, not per message.
  • High-throughput, sequential I/O makes it suitable for hundreds of MB/s per broker.
  • Supports replaying historical events — invaluable for rebuilding derived state (search indexes, caches, analytics).

Rule of thumb: use a traditional queue when you need work distribution with routing flexibility and you do not need replay. Use a log broker when you need high throughput, fan-out to many consumers, or event sourcing.

Remember: "Queues eat messages, logs keep them." A traditional queue deletes after ack; a log broker retains. If you need replay, you need a log.


Core Concepts

Producers, Consumers, and Brokers

  • Producer (publisher): the application that creates and sends messages.
  • Broker: the server (or cluster) that receives, stores, and routes messages.
  • Consumer (subscriber): the application that reads and processes messages.
  • Queue (traditional): a named buffer that holds messages until a consumer takes them.
  • Topic (log-based): a named log partitioned across brokers.

Exchanges and Bindings (RabbitMQ)

RabbitMQ adds a routing layer between producers and queues:

Producer → Exchange → Binding → Queue → Consumer

The exchange receives messages from producers and routes them to queues based on the routing key and exchange type. Bindings define the rules connecting an exchange to a queue.

Exchange types: - Direct: routes to queues whose binding key exactly matches the routing key. Used for point-to-point. - Fanout: routes to all bound queues, ignoring routing key. Used for broadcast/pub-sub. - Topic: routes based on wildcard pattern matching on routing key. logs.# matches logs.error, logs.warn.db, etc. - Headers: routes based on message header attributes instead of routing key. Rarely used.

Topics and Partitions (Kafka)

A Kafka topic is divided into one or more partitions. Each partition is an ordered, immutable sequence of records. Partitions enable parallelism: different partitions can be written and read by different brokers and consumers concurrently.


Delivery Semantics

Getting delivery semantics wrong causes either data loss or duplicate processing.

At-Most-Once

Message is delivered zero or one times. The broker does not wait for consumer acknowledgment before discarding the message. If the consumer crashes after receiving but before processing, the message is lost.

  • Use case: metrics, telemetry where occasional loss is acceptable.
  • Implementation: auto-ack before processing; Kafka enable.auto.commit=true with short interval.

At-Least-Once

Message is delivered one or more times. The broker retains the message until the consumer explicitly acknowledges success. If the consumer crashes mid-processing, the message is redelivered — but duplicate processing is possible.

  • Use case: most operational systems. Default safe choice.
  • Requirement: consumers must be idempotent — processing the same message twice must produce the same result.
  • Implementation: manual ack after processing; Kafka manual offset commit after handling.

Exactly-Once

Message is delivered and processed exactly once. Requires coordination between producer, broker, and consumer to prevent both loss and duplication.

  • Kafka: achieved via idempotent producers (enable.idempotence=true) + transactional APIs (transactional.id) + read-process-write transactions. Not free: ~20% throughput overhead, requires careful consumer configuration.

Gotcha: "Exactly-once" in Kafka is exactly-once within Kafka. The moment your consumer has a side effect outside Kafka (writes to a database, calls an API), you are back to at-least-once unless the consumer is idempotent. The Kafka docs themselves call this "effectively once." - RabbitMQ: no native exactly-once. Achieve it at the application layer via idempotent consumers + deduplication keys. - Use case: financial transactions, inventory adjustments, billing events.


Consumer Groups and Partitioning (Kafka)

A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group at a time. This is Kafka's scalability mechanism.

Topic: orders (4 partitions: P0, P1, P2, P3)

Consumer Group A (2 consumers):
  Consumer A1 → P0, P1
  Consumer A2 → P2, P3

Consumer Group B (4 consumers):
  Consumer B1 → P0
  Consumer B2 → P1
  Consumer B3 → P2
  Consumer B4 → P3

Key rules: - More consumers than partitions = idle consumers. You cannot scale beyond the partition count. - One partition per consumer group = serialized processing (safe for ordering-sensitive workflows). - Adding consumers triggers a rebalance where partition assignments are redistributed.

Partition key selection determines message routing. Messages with the same key always land on the same partition (within a topic), preserving ordering for that key. Example: use user_id as partition key if you need all events for a user to be processed in order.


Dead-Letter Queues and Poison Message Handling

A dead-letter queue (DLQ) is a holding area for messages that cannot be processed successfully. Without a DLQ, a poison message (malformed, triggers a bug, causes the consumer to crash) can block queue processing indefinitely.

RabbitMQ DLQ Setup

# Declare main queue with dead-letter exchange
rabbitmqadmin declare queue name=orders \
  arguments='{"x-dead-letter-exchange":"orders.dlx","x-dead-letter-routing-key":"orders.dead"}'

# Declare the DLX and DLQ
rabbitmqadmin declare exchange name=orders.dlx type=direct
rabbitmqadmin declare queue name=orders.dlq
rabbitmqadmin declare binding source=orders.dlx \
  destination=orders.dlq routing_key=orders.dead

Messages land in the DLQ when: - Consumer nacks with requeue=false - Message TTL expires - Queue length limit is exceeded

Kafka DLQ Pattern

Kafka has no native DLQ concept. Implement it in the consumer:

try:
    process(record)
    consumer.commit()
except ProcessingError as e:
    # Produce to a dead-letter topic
    producer.produce(
        topic=f"{record.topic()}.dlq",
        value=record.value(),
        headers=[("error", str(e).encode()), ("original-offset", str(record.offset()).encode())]
    )
    consumer.commit()

Retry with Exponential Backoff

For transient errors, retry before sending to DLQ. Use retry topics: ordersorders.retry.1orders.retry.2orders.dlq

Each retry topic consumer sleeps for progressively longer periods before re-processing.


Backpressure and Flow Control

Backpressure is the mechanism by which a slow consumer signals a fast producer to slow down, preventing unbounded queue growth and OOM crashes.

Kafka Backpressure

Producers have built-in flow control: - max.block.ms: how long the producer blocks waiting for buffer space before throwing an exception. - buffer.memory: total producer buffer size. - linger.ms + batch.size: batching reduces throughput variability.

Consumers control their own pace: they pull records, so a slow consumer simply does not fetch more until ready.

RabbitMQ Prefetch (QoS)

The basic.qos prefetch count limits how many unacked messages the broker sends to a consumer at once:

channel.basic_qos(prefetch_count=10)

Without prefetch, RabbitMQ can push the entire queue to a single consumer's in-memory buffer, starving other consumers and causing that consumer to OOM.

Rule: always set prefetch. A value of 1 maximizes fairness but minimizes throughput. A value of 10-100 is a common starting point; tune based on message processing time.

RabbitMQ Flow Alarms

When broker memory exceeds vm_memory_high_watermark (default 40% of RAM), RabbitMQ blocks all producer connections. This is a hard backpressure signal. Monitor for this alarm — it indicates sustained consumer lag.


Outbox Pattern for Transactional Messaging

The core problem: how do you atomically update the database and publish a message? If you update the DB and then publish, a crash between the two leaves the DB updated but no message published. If you publish first, a crash leaves a published message with no DB change.

The outbox pattern solves this with a two-phase approach:

  1. Write phase: in the same DB transaction as your business logic, write the event to an outbox table.
  2. Relay phase: a separate relay process (or CDC connector) reads the outbox table and publishes events to the broker, marking rows as published.
-- Within the business transaction
BEGIN;
UPDATE orders SET status = 'confirmed' WHERE id = $1;
INSERT INTO outbox (event_type, payload, created_at)
  VALUES ('order.confirmed', $2::jsonb, NOW());
COMMIT;

The relay can use polling or Change Data Capture (Debezium) to watch the outbox table. CDC-based relay achieves near-zero latency and avoids polling overhead.

Why it matters: the outbox pattern is the correct way to integrate relational databases with event-driven systems without distributed transactions.

Under the hood: The outbox pattern sidesteps the Two-Phase Commit (2PC) problem. 2PC across a database and a message broker requires both to support XA transactions, which is fragile and slow. The outbox pattern uses the database's own ACID transaction as the single source of truth, then relays asynchronously. Debezium (Red Hat's CDC tool) is the most popular outbox relay implementation.


Kafka Specifics

Partitions, Offsets, and Rebalancing

Each message in a Kafka partition has a monotonically increasing offset. Consumer groups track their position via committed offsets, stored in the __consumer_offsets internal topic.

When the consumer group membership changes (consumer joins, leaves, or crashes), Kafka triggers a rebalance. During rebalance, all consumers in the group stop consuming (stop-the-world by default in older protocols). The group coordinator redistributes partition assignments.

Frequent rebalances (due to slow consumers triggering session timeouts, or frequent deploys) cause consumer lag spikes. Mitigate with: - Increase session.timeout.ms and heartbeat.interval.ms - Increase max.poll.interval.ms for slow processing - Use incremental cooperative rebalancing (Kafka 2.4+): only reassigned partitions are paused, not all. - Use static group membership (group.instance.id) to avoid rebalances on restart.

Log Compaction

Kafka can compact topics instead of (or in addition to) time/size-based retention. Log compaction keeps the latest value for each key in a partition, discarding older records with the same key. Tombstones (null value) mark deletions.

Compacted topics are suitable for changelog streams where only the latest state matters: configuration changes, user profile updates, inventory levels. Consumers can bootstrap state by replaying the compacted log from the beginning.

Enable compaction: cleanup.policy=compact (or compact,delete for both compaction and time-based deletion).

Producer Tuning

# Reliability
acks=all                    # all in-sync replicas must ack
enable.idempotence=true     # dedup at producer level
retries=2147483647          # retry until max.block.ms
max.in.flight.requests.per.connection=5  # safe with idempotence enabled

# Throughput
batch.size=65536            # larger batches, higher throughput
linger.ms=5                 # wait up to 5ms to fill batch
compression.type=lz4        # compress before send

RabbitMQ Specifics

Exchange Types in Detail

# Direct exchange — route by exact routing key
rabbitmqadmin declare exchange name=tasks type=direct durable=true
rabbitmqadmin declare queue name=email-tasks durable=true
rabbitmqadmin declare binding source=tasks destination=email-tasks routing_key=email

# Topic exchange — route by pattern
rabbitmqadmin declare exchange name=logs type=topic durable=true
rabbitmqadmin declare queue name=error-logs durable=true
rabbitmqadmin declare binding source=logs destination=error-logs routing_key="*.error"
rabbitmqadmin declare binding source=logs destination=error-logs routing_key="logs.#"

# Fanout exchange — broadcast to all bound queues
rabbitmqadmin declare exchange name=notifications type=fanout durable=true

Clustering and High Availability

RabbitMQ clusters share metadata (exchanges, queues, bindings, users) but queues are by default not replicated — queue data lives on the node where it was declared. Node failure = queue loss unless replicated.

Classic mirrored queues (pre-3.8): x-ha-policy: all mirrors queue content to all nodes. Synchronous replication. High write amplification.

Quorum queues (3.8+, recommended): Raft-based replication. Majority quorum required for writes. Better durability guarantees, better performance under failures. Use these for any production queue.

Who made it: RabbitMQ was created in 2007 by Rabbit Technologies, a small London startup, using Erlang -- a language originally designed by Ericsson for telephone switches. Erlang's lightweight process model and fault-tolerance features made it an ideal fit for a message broker. Pivotal (now VMware Tanzu) acquired it in 2013.

rabbitmqadmin declare queue name=critical-tasks \
  arguments='{"x-queue-type":"quorum"}'

Monitoring and Observability

Critical Metrics

Metric What it tells you
Consumer lag (Kafka) Messages produced but not yet consumed. Growing lag = consumer is falling behind.
Queue depth (RabbitMQ) Messages waiting to be consumed. Growing depth = backlog.
Unacked messages (RabbitMQ) Messages delivered to consumers but not acknowledged. High count = slow consumers or consumer crash.
Messages/sec in vs out Throughput balance. If in > out persistently, backlog grows.
Rebalance rate (Kafka) Frequent rebalances indicate instability.
Broker disk/memory Brokers running out of disk stop accepting writes.
Error rate / DLQ depth Indicates processing failures.

Consumer Lag Monitoring (Kafka)

# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-consumer-group

# Output columns: TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG, CONSUMER-ID

Lag = LOG-END-OFFSET - CURRENT-OFFSET. Alert when lag > threshold or when lag is growing (trend matters more than absolute value).

Burrow (LinkedIn's open source tool) tracks lag trends rather than snapshots, reducing false alarms from normal batch processing spikes.

RabbitMQ Monitoring

# Queue depths and consumer counts
rabbitmqctl list_queues name messages consumers memory

# Node health
rabbitmqctl node_health_check

# Check for flow control (blocked producers)
rabbitmqctl list_connections state | grep blocked

Expose metrics to Prometheus via the rabbitmq_prometheus plugin (enabled by default in 3.8+). Key metrics: rabbitmq_queue_messages, rabbitmq_queue_messages_unacked, rabbitmq_queue_messages_ready.


Key Takeaways

  1. Choose the right tool: traditional queues for task distribution; log brokers for event streaming and fan-out.
  2. Default to at-least-once and make consumers idempotent. Exactly-once is expensive and rarely necessary.
  3. Always configure DLQs. Poison messages without a DLQ block or drop silently.
  4. Set prefetch on RabbitMQ consumers. Without it, one consumer can monopolize the queue.
  5. Monitor consumer lag, not just queue depth. Lag trend reveals whether you are keeping up.
  6. Use the outbox pattern to atomically combine DB writes with event publication.
  7. Partition count determines Kafka parallelism ceiling. Plan partition count ahead — it is hard to change.
  8. Rebalances are disruptive. Use incremental cooperative rebalancing and static membership for production Kafka.
  9. Use quorum queues in RabbitMQ for any queue that must survive a broker node failure.
  10. Backpressure is not optional. Unbounded queues eventually exhaust broker memory or disk.

Wiki Navigation

  • Message Queues Flashcards (CLI) (flashcard_deck, L1) — Message Queues