Comparison: Messaging¶
Category: Storage Last meaningful update consideration: 2026-03 Verdict (opinionated): SQS for simple AWS queues — zero ops, just works. Kafka for event streaming and replay. RabbitMQ for complex routing patterns. NATS for lightweight, high-performance messaging.
Quick Decision Matrix¶
| Factor | RabbitMQ | Kafka | NATS | SQS |
|---|---|---|---|---|
| Learning curve | Medium | High | Low-Medium | Low |
| Operational overhead | Medium | High | Low | None (AWS-managed) |
| Cost at small scale | Free (self-hosted) | Free (self-hosted) | Free (self-hosted) | Pay-per-request (~$0) |
| Cost at large scale | Medium | Medium-High | Low | Moderate ($0.40/M requests) |
| Community/ecosystem | Large | Massive | Growing | AWS-only |
| Hiring | Moderate | Growing (hot skill) | Growing | Easy (AWS) |
| Message model | Queue (push to consumers) | Log (pull from offset) | Pub/sub + queue (JetStream) | Queue (poll-based) |
| Ordering | Per-queue FIFO | Per-partition ordering | Per-stream (JetStream) | FIFO queues (optional) |
| Replay | No (consumed = gone) | Yes (offset-based) | Yes (JetStream) | No |
| Routing | Advanced (exchanges, bindings, headers) | Topic partitions | Subjects (hierarchical) | Basic (queue per consumer) |
| Persistence | Optional (durable queues) | Always (commit log) | Optional (JetStream) | Always (AWS-managed) |
| Throughput | ~50K msg/s per node | ~1M msg/s per cluster | ~10M msg/s per node | Limited by API rate |
| Latency | Low (ms) | Low (ms) | Very low (sub-ms) | Higher (polling delay) |
| Protocol | AMQP 0.9.1, MQTT, STOMP | Kafka protocol (binary) | NATS protocol (text-based) | HTTP/SQS API |
| Dead letter queue | Yes | No (manual implementation) | No (manual) | Yes (native) |
| Exactly-once | No (at-least-once) | Yes (with transactions) | No (at-least-once) | Yes (FIFO dedup) |
| K8s operator | RabbitMQ Cluster Operator | Strimzi, Confluent Operator | NATS Operator | N/A |
When to Pick Each¶
Pick RabbitMQ when:¶
- You need flexible message routing: topic exchanges, header-based routing, fan-out to multiple queues
- Work queue pattern: distribute tasks among workers with acknowledgment and retry
- You need multiple protocol support: AMQP, MQTT (IoT), STOMP (web sockets)
- Dead letter queues for failed message handling are built-in
- Message priority is needed (priority queues)
- Your team wants a traditional message broker that is conceptually simple
Pick Kafka when:¶
- You need event streaming, not just messaging — event replay, reprocessing, and audit trails
- High throughput and durable ordered event logs are requirements
- Multiple consumers need to read the same events independently (consumer groups)
- You are building event-driven architectures, CQRS, or change data capture pipelines
- You need to retain events for days/weeks/indefinitely (not just until consumed)
- Stream processing (Kafka Streams, ksqlDB, Flink) is on your roadmap
Pick NATS when:¶
- You need lightweight, high-performance pub/sub with minimal operational overhead
- Latency is critical — NATS has the lowest latency of any option here
- You want a single binary that runs anywhere (bare metal, K8s, edge, IoT)
- JetStream persistence is sufficient (you do not need Kafka-level stream processing)
- Your architecture is microservices-heavy and needs request-reply patterns
- Edge computing or IoT messaging where Kafka is too heavy
Pick SQS when:¶
- You are AWS-only and want zero operational overhead
- Simple work queue: produce messages, consume messages, delete messages
- You do not need complex routing, streaming, or replay
- Dead letter queues, long polling, and visibility timeouts fit your model
- FIFO ordering is needed for specific use cases (SQS FIFO queues)
- You need a queue that scales from zero to millions of messages without provisioning
Nobody Tells You¶
RabbitMQ¶
- RabbitMQ queues are stored in memory with optional disk persistence. Under load, if consumers are slow, queues grow in memory and trigger flow control (backpressure), which stops publishers. This can cascade to producer timeouts and application failures.
- The management UI is useful for debugging but also a security concern. Default credentials (
guest/guest) only work from localhost, but many deployments expose the management plugin improperly. - Mirrored queues (HA) were replaced by quorum queues in RabbitMQ 3.8+. Quorum queues use Raft consensus and are more reliable but have different performance characteristics. Migrate from mirrored queues.
- RabbitMQ's Erlang runtime introduces operational complexity. Erlang upgrades, cookie management, and cluster formation errors are all Erlang-specific issues you would not encounter with a Go or Java application.
- Message acknowledgment is critical. Without proper ack/nack handling, messages are either lost (auto-ack) or redelivered infinitely (no ack). This is the #1 source of production issues.
- The exchange → binding → queue model is powerful but creates a configuration surface that must be managed. Exchange and queue declarations should be in code (not manual), and lifecycle management (deleting unused queues) requires attention.
Kafka¶
- Kafka is operationally demanding. ZooKeeper dependency (being removed with KRaft, but migration is still in progress), partition rebalancing, log compaction, and retention management are all ongoing operational concerns.
- KRaft mode (replacing ZooKeeper) is production-ready but migration from ZooKeeper to KRaft for existing clusters is a multi-step process. New clusters should use KRaft from the start.
- Partition count is a permanent-ish decision. Adding partitions to a topic does not redistribute existing data. Reducing partitions requires recreating the topic. Plan partition count based on expected throughput.
- Consumer group rebalancing causes processing pauses. During a rebalance (consumer joins/leaves/crashes), all consumers in the group stop processing. For latency-sensitive applications, this is a problem. Cooperative rebalancing mitigates but does not eliminate this.
- Kafka's "exactly-once" semantics require idempotent producers AND transactional consumers. Most applications settle for "at-least-once" with idempotent consumers because the exactly-once configuration is complex.
- Confluent Platform adds schema registry, KSQL, and management UI but the licensing model is commercial. Open-source Kafka requires assembling your own ecosystem.
- Disk throughput is Kafka's bottleneck, not CPU or memory. Use SSDs, ensure separate disks for logs and data, and monitor disk utilization religiously.
NATS¶
- NATS core (without JetStream) is fire-and-forget. If no subscriber is listening when a message is published, the message is lost. This is by design for pub/sub but surprises teams expecting queue semantics.
- JetStream adds persistence, replay, and exactly-once delivery to NATS. It is essentially NATS's answer to Kafka but is newer and less battle-tested at Kafka-scale workloads.
- NATS's simplicity means fewer features. No built-in dead letter queues, no message priority, no complex routing (compared to RabbitMQ exchanges). You implement these patterns in your application.
- The NATS community is smaller than RabbitMQ or Kafka. When you hit an edge case, there are fewer resources available.
- NATS is written in Go and runs as a single binary with minimal dependencies. This makes it operationally simple but also means monitoring and observability require external tools.
- NATS supports request-reply patterns natively, making it suitable for synchronous RPC-style communication in addition to async messaging. This flexibility is underappreciated.
SQS¶
- SQS long polling (WaitTimeSeconds > 0) is essential. Without it, you pay for empty receive requests and your consumer loops waste CPU. Always use long polling.
- SQS message visibility timeout determines how long a message is hidden after being received. If processing takes longer than the timeout, another consumer receives the same message. Set timeouts carefully or use heartbeats.
- SQS FIFO queues have a throughput limit of 3,000 messages/second per queue (with batching). Standard queues are effectively unlimited but unordered.
- SQS does not push messages to consumers. Your application must poll. For near-real-time processing, Lambda triggers with SQS event source mappings eliminate the polling concern.
- Message size limit is 256KB. Larger payloads require the "claim check" pattern: store the payload in S3, send the S3 key in the SQS message.
- SQS deduplication (FIFO) uses a 5-minute deduplication window. Messages with the same deduplication ID within 5 minutes are dropped. This can be a feature (exactly-once) or a bug (legitimate duplicate messages) depending on your use case.
- There is no SQS equivalent outside AWS. If you build on SQS and later need to go multi-cloud, you are rewriting your messaging layer.
Migration Pain Assessment¶
| From → To | Effort | Risk | Timeline |
|---|---|---|---|
| RabbitMQ → Kafka | High | High | 2-4 months |
| Kafka → RabbitMQ | High | High | 2-4 months |
| SQS → RabbitMQ | Medium | Medium | 1-2 months |
| SQS → Kafka | Medium-High | Medium | 2-3 months |
| RabbitMQ → NATS | Medium | Medium | 1-2 months |
| Any → SQS | Medium | Low | 1-2 months |
Messaging migration is high-risk because message loss during migration can cause data inconsistency. The safe pattern: run both systems in parallel, dual-publish to both, migrate consumers one by one, then decommission the old system. Never cut over atomically.
The Interview Answer¶
"The choice depends on the messaging pattern. SQS for simple work queues on AWS — zero ops, infinite scale. Kafka for event streaming where consumers need to replay events, build materialized views, or reprocess data. RabbitMQ for complex routing patterns where exchanges, bindings, and priority queues matter. NATS for lightweight, low-latency pub/sub. The key insight is that queues and event streams are fundamentally different patterns: queues are 'process this task,' streams are 'this thing happened.' Kafka is a stream; SQS and RabbitMQ are queues. Choosing the wrong pattern causes more pain than choosing the wrong tool."
Cross-References¶
- Topic Packs: Kafka, RabbitMQ, Message Queues
- Related Comparisons: Relational Databases, Caching