Comparison: Messaging¶

Category: Storage Last meaningful update consideration: 2026-03 Verdict (opinionated): SQS for simple AWS queues — zero ops, just works. Kafka for event streaming and replay. RabbitMQ for complex routing patterns. NATS for lightweight, high-performance messaging.

Quick Decision Matrix¶

Factor	RabbitMQ	Kafka	NATS	SQS
Learning curve	Medium	High	Low-Medium	Low
Operational overhead	Medium	High	Low	None (AWS-managed)
Cost at small scale	Free (self-hosted)	Free (self-hosted)	Free (self-hosted)	Pay-per-request (~$0)
Cost at large scale	Medium	Medium-High	Low	Moderate ($0.40/M requests)
Community/ecosystem	Large	Massive	Growing	AWS-only
Hiring	Moderate	Growing (hot skill)	Growing	Easy (AWS)
Message model	Queue (push to consumers)	Log (pull from offset)	Pub/sub + queue (JetStream)	Queue (poll-based)
Ordering	Per-queue FIFO	Per-partition ordering	Per-stream (JetStream)	FIFO queues (optional)
Replay	No (consumed = gone)	Yes (offset-based)	Yes (JetStream)	No
Routing	Advanced (exchanges, bindings, headers)	Topic partitions	Subjects (hierarchical)	Basic (queue per consumer)
Persistence	Optional (durable queues)	Always (commit log)	Optional (JetStream)	Always (AWS-managed)
Throughput	~50K msg/s per node	~1M msg/s per cluster	~10M msg/s per node	Limited by API rate
Latency	Low (ms)	Low (ms)	Very low (sub-ms)	Higher (polling delay)
Protocol	AMQP 0.9.1, MQTT, STOMP	Kafka protocol (binary)	NATS protocol (text-based)	HTTP/SQS API
Dead letter queue	Yes	No (manual implementation)	No (manual)	Yes (native)
Exactly-once	No (at-least-once)	Yes (with transactions)	No (at-least-once)	Yes (FIFO dedup)
K8s operator	RabbitMQ Cluster Operator	Strimzi, Confluent Operator	NATS Operator	N/A

When to Pick Each¶

Pick RabbitMQ when:¶

You need flexible message routing: topic exchanges, header-based routing, fan-out to multiple queues
Work queue pattern: distribute tasks among workers with acknowledgment and retry
You need multiple protocol support: AMQP, MQTT (IoT), STOMP (web sockets)
Dead letter queues for failed message handling are built-in
Message priority is needed (priority queues)
Your team wants a traditional message broker that is conceptually simple

Pick Kafka when:¶

You need event streaming, not just messaging — event replay, reprocessing, and audit trails
High throughput and durable ordered event logs are requirements
Multiple consumers need to read the same events independently (consumer groups)
You are building event-driven architectures, CQRS, or change data capture pipelines
You need to retain events for days/weeks/indefinitely (not just until consumed)
Stream processing (Kafka Streams, ksqlDB, Flink) is on your roadmap

Pick NATS when:¶

You need lightweight, high-performance pub/sub with minimal operational overhead
Latency is critical — NATS has the lowest latency of any option here
You want a single binary that runs anywhere (bare metal, K8s, edge, IoT)
JetStream persistence is sufficient (you do not need Kafka-level stream processing)
Your architecture is microservices-heavy and needs request-reply patterns
Edge computing or IoT messaging where Kafka is too heavy

Pick SQS when:¶

You are AWS-only and want zero operational overhead
Simple work queue: produce messages, consume messages, delete messages
You do not need complex routing, streaming, or replay
Dead letter queues, long polling, and visibility timeouts fit your model
FIFO ordering is needed for specific use cases (SQS FIFO queues)
You need a queue that scales from zero to millions of messages without provisioning

Nobody Tells You¶

RabbitMQ¶

RabbitMQ queues are stored in memory with optional disk persistence. Under load, if consumers are slow, queues grow in memory and trigger flow control (backpressure), which stops publishers. This can cascade to producer timeouts and application failures.
The management UI is useful for debugging but also a security concern. Default credentials (guest/guest) only work from localhost, but many deployments expose the management plugin improperly.
Mirrored queues (HA) were replaced by quorum queues in RabbitMQ 3.8+. Quorum queues use Raft consensus and are more reliable but have different performance characteristics. Migrate from mirrored queues.
RabbitMQ's Erlang runtime introduces operational complexity. Erlang upgrades, cookie management, and cluster formation errors are all Erlang-specific issues you would not encounter with a Go or Java application.
Message acknowledgment is critical. Without proper ack/nack handling, messages are either lost (auto-ack) or redelivered infinitely (no ack). This is the #1 source of production issues.
The exchange → binding → queue model is powerful but creates a configuration surface that must be managed. Exchange and queue declarations should be in code (not manual), and lifecycle management (deleting unused queues) requires attention.

Kafka¶

Kafka is operationally demanding. ZooKeeper dependency (being removed with KRaft, but migration is still in progress), partition rebalancing, log compaction, and retention management are all ongoing operational concerns.
KRaft mode (replacing ZooKeeper) is production-ready but migration from ZooKeeper to KRaft for existing clusters is a multi-step process. New clusters should use KRaft from the start.
Partition count is a permanent-ish decision. Adding partitions to a topic does not redistribute existing data. Reducing partitions requires recreating the topic. Plan partition count based on expected throughput.
Consumer group rebalancing causes processing pauses. During a rebalance (consumer joins/leaves/crashes), all consumers in the group stop processing. For latency-sensitive applications, this is a problem. Cooperative rebalancing mitigates but does not eliminate this.
Kafka's "exactly-once" semantics require idempotent producers AND transactional consumers. Most applications settle for "at-least-once" with idempotent consumers because the exactly-once configuration is complex.
Confluent Platform adds schema registry, KSQL, and management UI but the licensing model is commercial. Open-source Kafka requires assembling your own ecosystem.
Disk throughput is Kafka's bottleneck, not CPU or memory. Use SSDs, ensure separate disks for logs and data, and monitor disk utilization religiously.

NATS¶

NATS core (without JetStream) is fire-and-forget. If no subscriber is listening when a message is published, the message is lost. This is by design for pub/sub but surprises teams expecting queue semantics.
JetStream adds persistence, replay, and exactly-once delivery to NATS. It is essentially NATS's answer to Kafka but is newer and less battle-tested at Kafka-scale workloads.
NATS's simplicity means fewer features. No built-in dead letter queues, no message priority, no complex routing (compared to RabbitMQ exchanges). You implement these patterns in your application.
The NATS community is smaller than RabbitMQ or Kafka. When you hit an edge case, there are fewer resources available.
NATS is written in Go and runs as a single binary with minimal dependencies. This makes it operationally simple but also means monitoring and observability require external tools.
NATS supports request-reply patterns natively, making it suitable for synchronous RPC-style communication in addition to async messaging. This flexibility is underappreciated.

SQS¶

SQS long polling (WaitTimeSeconds > 0) is essential. Without it, you pay for empty receive requests and your consumer loops waste CPU. Always use long polling.
SQS message visibility timeout determines how long a message is hidden after being received. If processing takes longer than the timeout, another consumer receives the same message. Set timeouts carefully or use heartbeats.
SQS FIFO queues have a throughput limit of 3,000 messages/second per queue (with batching). Standard queues are effectively unlimited but unordered.
SQS does not push messages to consumers. Your application must poll. For near-real-time processing, Lambda triggers with SQS event source mappings eliminate the polling concern.
Message size limit is 256KB. Larger payloads require the "claim check" pattern: store the payload in S3, send the S3 key in the SQS message.
SQS deduplication (FIFO) uses a 5-minute deduplication window. Messages with the same deduplication ID within 5 minutes are dropped. This can be a feature (exactly-once) or a bug (legitimate duplicate messages) depending on your use case.
There is no SQS equivalent outside AWS. If you build on SQS and later need to go multi-cloud, you are rewriting your messaging layer.

Migration Pain Assessment¶

From → To	Effort	Risk	Timeline
RabbitMQ → Kafka	High	High	2-4 months
Kafka → RabbitMQ	High	High	2-4 months
SQS → RabbitMQ	Medium	Medium	1-2 months
SQS → Kafka	Medium-High	Medium	2-3 months
RabbitMQ → NATS	Medium	Medium	1-2 months
Any → SQS	Medium	Low	1-2 months

Messaging migration is high-risk because message loss during migration can cause data inconsistency. The safe pattern: run both systems in parallel, dual-publish to both, migrate consumers one by one, then decommission the old system. Never cut over atomically.

The Interview Answer¶

"The choice depends on the messaging pattern. SQS for simple work queues on AWS — zero ops, infinite scale. Kafka for event streaming where consumers need to replay events, build materialized views, or reprocess data. RabbitMQ for complex routing patterns where exchanges, bindings, and priority queues matter. NATS for lightweight, low-latency pub/sub. The key insight is that queues and event streams are fundamentally different patterns: queues are 'process this task,' streams are 'this thing happened.' Kafka is a stream; SQS and RabbitMQ are queues. Choosing the wrong pattern causes more pain than choosing the wrong tool."

Cross-References¶

Topic Packs: Kafka, RabbitMQ, Message Queues
Related Comparisons: Relational Databases, Caching