Kafka: The Log That Runs Everything
- lesson
- kafka-architecture
- commit-logs
- distributed-systems
- consumer-groups
- replication
- schema-registry
- kraft
- monitoring
- message-queues ---# Kafka — The Log That Runs Everything
Topics: Kafka architecture, commit logs, distributed systems, consumer groups, replication, Schema Registry, KRaft, monitoring, message queues Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Strategy: Archaeological + incident-driven Prerequisites: None (everything explained from scratch)
The Mission¶
It's 3:12 AM. PagerDuty wakes you. The alert reads:
CRITICAL: Consumer lag for group order-processor on topic order-events
exceeding 500,000 messages and growing. Duration: 23 minutes.
You pull up Grafana on your phone. The consumer lag graph is a cliff going up. Producers are publishing fine — 12,000 messages per second, steady. But the consumers? They're crawling. The gap between "messages written" and "messages read" is widening every second.
Downstream, the order confirmation emails stopped. The inventory service is stale. The analytics dashboard froze 20 minutes ago. Everything that depends on consuming from Kafka is silently falling behind.
You need to figure out what's happening, fix it, and understand enough about Kafka's architecture to prevent this from happening again. Let's start.
Part 1: What Kafka Actually Is (It's Not a Queue)¶
Before touching the terminal, you need one mental model. Get this right and everything else follows. Get it wrong and you'll fight Kafka for years.
Kafka is an append-only commit log.
Not a queue. Not a message bus. A log. Like the write-ahead log in PostgreSQL. Like a ledger where you can only add entries at the end, never edit or delete them.
Traditional queue (RabbitMQ, SQS): Kafka:
┌───────────────────────┐ ┌───────────────────────────────────┐
│ msg → [consumer] → ✓ │ │ [0] [1] [2] [3] [4] [5] [6] ... │
│ (message deleted)│ │ ↑ ↑ │
└───────────────────────┘ │ consumer A consumer B │
│ (each reads at their own pace)│
└───────────────────────────────────┘
In a traditional queue, a message is consumed and deleted. Gone. One consumer reads it, acknowledges it, and the broker throws it away.
In Kafka, messages stay. They sit in the log for hours, days, or years — however long retention is configured. Consumers don't "take" messages. They maintain a pointer (called an offset) into the log and read forward from there. Multiple consumer groups can each maintain their own pointer, replaying the same data independently.
Mental Model: Think of Kafka as a newspaper archive, not a mailbox. A mailbox delivers once and the letter is gone. A newspaper archive lets anyone walk in, pick a date, and read from there forward. Different readers can be on different dates. The archive doesn't care — it just keeps the papers on the shelf until the retention policy says to shred them.
Name Origin: Jay Kreps, one of Kafka's creators at LinkedIn, named it after the Czech author Franz Kafka because "it is a system optimized for writing" and he liked Kafka's work. He later admitted the name has no deep technical significance — it just sounded good for a messaging system. The literary Kafka's work — bureaucratic labyrinths, inexplicable processes, overwhelming systems — is arguably a fitting metaphor for enterprise messaging infrastructure.
The Commit Log Mental Model¶
This "append-only, immutable, ordered" design is borrowed from database internals. PostgreSQL's WAL (write-ahead log), MySQL's binlog, and Kafka's topic partitions are all the same idea: an ordered sequence of immutable records.
Jay Kreps, Neha Narkhede, and Jun Rao designed Kafka at LinkedIn in 2010. Their insight: the database WAL is the most fundamental data structure in computing — it records "what happened, in order." They took that concept out of the database engine and made it a standalone distributed system. Kreps detailed this in his influential 2013 blog post "The Log: What every software engineer should know about real-time data's unifying abstraction."
This matters operationally because: - Messages are never moved or modified — only appended and eventually expired - Reads are sequential — consumers scan forward through ordered data on disk - Disk is the storage tier — Kafka relies on the OS page cache and sequential I/O, not JVM heap
Under the Hood: Kafka achieves millions of messages per second because sequential disk I/O is fast — often faster than random memory access. A single broker can push 2+ million messages per second (100-byte messages) by leveraging the OS page cache, zero-copy transfers via
sendfile(), and batch compression. The JVM heap barely participates in data transfer.
Part 2: Architecture — Brokers, Topics, Partitions¶
Now that you have the mental model, let's map it to the real components.
The Cluster¶
A Kafka cluster is a set of brokers — JVM processes that store and serve data. Production clusters typically run 3–12 brokers, each on its own machine (or pod).
Kafka Cluster (3 brokers)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Broker 1 │ │ Broker 2 │ │ Broker 3 │
│ (ctrl) │ │ │ │ │
│ P0-leader │ │ P1-leader │ │ P2-leader │
│ P1-replica │ │ P2-replica │ │ P0-replica │
│ P2-replica │ │ P0-replica │ │ P1-replica │
└──────────────┘ └──────────────┘ └──────────────┘
Topics and Partitions¶
A topic is a named stream. order-events, user-signups, payment-results. Topics
are split into partitions for parallelism. Each partition is an independent, ordered,
append-only log.
Topic: order-events (3 partitions)
Partition 0: [0] [1] [2] [3] [4] [5] →
Partition 1: [0] [1] [2] [3] →
Partition 2: [0] [1] [2] [3] [4] [5] [6] [7] →
The numbers are offsets — sequential IDs within each partition. Offset 0 is the first message ever written to that partition. They never reset, never reuse.
Critical rule: Messages are ordered within a partition, not across partitions. If you
need all events for order-123 to arrive in order, use the order ID as the message
key — Kafka hashes the key to pick a partition, so all messages with the same key
land in the same partition.
# Describe a topic — see partitions, replicas, and ISR
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
--describe --topic order-events
Topic: order-events PartitionCount: 12 ReplicationFactor: 3
Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
...
Each line tells you: which broker is the leader (handles reads and writes), which brokers hold replicas, and which replicas are in-sync (ISR).
Segments: How Data Hits Disk¶
Each partition is stored as a sequence of segment files on the broker's filesystem:
-rw-r--r-- 1 kafka kafka 1.0G 00000000000000000000.log
-rw-r--r-- 1 kafka kafka 10M 00000000000000000000.index
-rw-r--r-- 1 kafka kafka 10M 00000000000000000000.timeindex
-rw-r--r-- 1 kafka kafka 536M 00000000000001048576.log ← active segment
-rw-r--r-- 1 kafka kafka 10M 00000000000001048576.index
The filename is the first offset in that segment. Older segments are immutable. Only the active segment receives new writes. Retention deletes entire segments, not individual messages.
Part 3: Back to the Incident — Diagnosing Consumer Lag¶
You're awake, your laptop is open, the lag is at 600,000 and climbing. Time to diagnose.
Step 1: How Bad Is It?¶
# Describe the consumer group — see lag per partition
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--describe --group order-processor
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
order-processor order-events 0 152340 152346 6 consumer-1-abc
order-processor order-events 1 121000 181000 60000 consumer-2-def
order-processor order-events 2 149990 150002 12 consumer-3-ghi
order-processor order-events 3 98500 180200 81700 -
order-processor order-events 4 165000 165050 50 consumer-4-jkl
order-processor order-events 5 110000 179800 69800 -
Two things jump out: 1. Partitions 1, 3, and 5 have massive lag — 60K, 82K, and 70K messages behind 2. Partitions 3 and 5 have no CONSUMER-ID — nobody is consuming them
That - in the consumer column is your problem. Those partitions are orphaned.
Step 2: Check Group State¶
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--describe --group order-processor --state
CompletingRebalance. The consumer group is stuck in a rebalance loop. Consumers keep
joining, getting partitions assigned, then something kicks them out, and the whole group
starts over. During each rebalance cycle, all consumers stop processing.
Gotcha: Before KIP-429 (cooperative rebalancing, Kafka 2.4+), every rebalance was stop-the-world — ALL consumers in the group paused, even if only one consumer changed. If you see a group stuck in
CompletingRebalanceorPreparingRebalance, check your client'spartition.assignment.strategy. The default changed to cooperative in newer clients, but legacy configs may still useRangeAssignor(the eager, stop-the-world strategy).
Step 3: Why Is It Rebalancing?¶
Common causes of rebalance loops:
1. A consumer takes too long to process a batch (exceeds max.poll.interval.ms, default
300 seconds) — the broker considers it dead
2. A consumer crashes, restarts, gets the same heavy partition, crashes again
3. Network flaps between consumers and the group coordinator broker
4. A deployment is rolling and consumers keep joining/leaving
Check the consumer application logs. In this case:
[2026-03-23 03:08:14] ERROR Consumer poll timeout expired. Leaving group.
[2026-03-23 03:08:14] WARN max.poll.interval.ms=300000 exceeded for partition order-events-1
There it is. One of the consumers is taking longer than 5 minutes to process a single batch. It gets kicked from the group, triggering a full rebalance. During the rebalance, all consumers pause. The slow consumer rejoins, gets a heavy partition again, exceeds the timeout again, and the cycle repeats.
Step 4: The Fix¶
Immediate fix — reduce the batch size so processing finishes within the poll interval:
# Consumer config — process fewer records per poll
max.poll.records=100 # was 500 (default)
max.poll.interval.ms=600000 # give more breathing room (10 min)
Restart the consumers. Watch the group state:
# Keep watching until state says "Stable"
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--describe --group order-processor --state
Stable. All 6 consumers are assigned partitions and draining the backlog. Lag starts dropping. Crisis averted.
Flashcard Check #1¶
| Question | Answer |
|---|---|
What does "LAG" mean in kafka-consumer-groups.sh --describe? |
The difference between the log-end-offset (latest message) and the consumer's current-offset (last committed). It's how many messages the consumer hasn't processed yet. |
A consumer group state shows CompletingRebalance. What's happening? |
Partitions are being redistributed among consumers. During this time, consumption is paused. If it's stuck in this state, a consumer is likely crash-looping or timing out. |
| You have a topic with 6 partitions and 8 consumers in a group. How many consumers are idle? | 2. Kafka assigns at most one consumer per partition per group. Extra consumers sit idle. |
What's the default max.poll.interval.ms? |
300,000 ms (5 minutes). If a consumer doesn't call poll() within this window, the broker considers it dead and triggers a rebalance. |
Part 4: Producer Acknowledgments — The Durability Dial¶
Now that the incident is resolved, let's understand the system that made this architecture possible. Start with how messages get into Kafka in the first place.
When a producer sends a message, it waits for an acknowledgment. The acks setting
controls how many brokers must confirm the write:
acks |
What happens | Durability | Latency |
|---|---|---|---|
0 |
Producer doesn't wait. Fire and forget. | None. Messages can be lost silently. | Lowest |
1 |
Leader broker writes to its log and acks. | Single broker. If the leader dies before replicating, the message is lost. | Medium |
all |
Leader waits for all in-sync replicas (ISR) to write. | Full. Message survives any single broker failure. | Highest |
Gotcha: Before Kafka 3.0, the producer default was
acks=1. Since 3.0, it'sacks=all. But if you're running an older client library or inherited a config template from 2019, you might still be onacks=1without knowing it. Check explicitly:grep acks producer.properties.
The gold standard for production:
This means: 3 copies of every message, and the producer's write won't succeed unless at least 2 replicas confirm. You can lose one broker without losing a single message and without producers experiencing errors.
Part 5: Replication and ISR — Why Brokers Fall Out of Sync¶
Each partition has a leader and zero or more follower replicas. Followers continuously fetch data from the leader and append it to their own logs. The set of replicas that are caught up to the leader is the ISR (In-Sync Replicas).
# Check for under-replicated partitions — the most important health check
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
--describe --under-replicated-partitions
Healthy output: nothing. Any output means a replica has fallen behind.
Broker 2 dropped out of the ISR for partition 4. That means:
- If broker 1 dies right now, you lose data unless min.insync.replicas=2
prevents the write in the first place
- If unclean.leader.election.enable=true (don't do this), broker 2 could
become leader and serve stale data
Why replicas fall behind: - Broker disk is slow or full - Network congestion between brokers - JVM garbage collection pauses (GC "stop-the-world") - Broker is overloaded with too many partition leaders
# Check disk on the lagging broker
ssh broker-2 'df -h /var/kafka-logs'
# Check GC pauses
ssh broker-2 'jstat -gcutil $(jps | grep Kafka | awk "{print \$1}") 1000 5'
| Config | Default | What it does |
|---|---|---|
min.insync.replicas |
1 | Minimum ISR count for acks=all to succeed |
unclean.leader.election.enable |
false | If true, allows out-of-sync replica to become leader (risks data loss) |
replica.lag.time.max.ms |
30000 | Time before a slow replica is removed from ISR |
Remember: ISR shrinkage is a leading indicator. Alert on it immediately. The formula: if
ISR count >= min.insync.replicas, writes succeed. IfISR count < min.insync.replicas, producers withacks=allgetNotEnoughReplicasException. This is safe — it's telling you the cluster can't meet durability requirements. The dangerous scenario is whenunclean.leader.electionis enabled and ISR is empty.
Part 6: Consumer Groups and Rebalancing — The Deeper Story¶
You fixed the incident by adjusting poll settings. But why does rebalancing exist at all, and why is it so disruptive?
How Consumer Groups Work¶
A consumer group is a set of consumers that cooperate to consume a topic. Kafka assigns each partition to exactly one consumer in the group:
Topic: order-events (6 partitions)
Consumer Group: order-processor (3 consumers)
consumer-1 → P0, P1
consumer-2 → P2, P3
consumer-3 → P4, P5
If consumer-2 dies, its partitions must go somewhere. That's a rebalance — the group coordinator (a designated broker) redistributes partitions among surviving consumers:
After consumer-2 dies:
consumer-1 → P0, P1, P2 ← picked up P2
consumer-3 → P3, P4, P5 ← picked up P3
Why Rebalancing Causes Pauses¶
With the old eager rebalancing protocol (default before Kafka 2.4):
- A consumer joins or leaves the group
- All consumers stop processing and revoke all partition assignments
- The coordinator computes new assignments
- Consumers receive their new assignments and resume
That bold step is the problem. Even if only one partition needs to move, every consumer stops. In a group with 50 consumers, a single slow consumer can freeze the entire group.
Trivia: Consumer group rebalancing has been called "the worst part of Kafka" by the community. It motivated years of engineering work on KIP-429 (incremental cooperative rebalancing), which landed in Kafka 2.4 (2019). With cooperative rebalancing, only the consumers that need to give up or receive partitions are paused — the rest keep processing.
Mitigating Rebalance Pain¶
| Technique | How it helps |
|---|---|
Cooperative rebalancing (CooperativeStickyAssignor) |
Only affected consumers pause during rebalance |
Static group membership (group.instance.id) |
Consumer restarts don't trigger rebalance within session.timeout.ms |
Increase session.timeout.ms |
Tolerate brief consumer hiccups without triggering rebalance |
Increase max.poll.interval.ms |
Allow more processing time per batch before eviction |
Decrease max.poll.records |
Smaller batches finish faster, less likely to exceed poll interval |
# Production consumer config to minimize rebalance disruption
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id=order-processor-pod-3 # static membership
session.timeout.ms=45000
max.poll.interval.ms=600000
max.poll.records=200
Flashcard Check #2¶
| Question | Answer |
|---|---|
What's the difference between acks=1 and acks=all? |
acks=1: leader confirms the write. acks=all: all in-sync replicas confirm. acks=all + min.insync.replicas=2 is the production standard. |
A topic has Replicas: 1,2,3 and Isr: 1,3. What happened? |
Broker 2 fell out of sync — it's not keeping up with the leader. Check disk I/O, network, and GC on broker 2. |
| What's the difference between eager and cooperative rebalancing? | Eager (legacy): all consumers stop, revoke all partitions, reassign everything. Cooperative (KIP-429, Kafka 2.4+): only affected partitions are revoked and reassigned. |
What does group.instance.id do? |
Enables static group membership. A consumer restart doesn't trigger a rebalance as long as it rejoins before session.timeout.ms expires. |
Part 7: KRaft — ZooKeeper Is Gone¶
If you learned Kafka before 2022, you learned about ZooKeeper — the separate distributed coordination service that Kafka depended on for metadata, controller election, and cluster membership. Every Kafka cluster needed a ZooKeeper ensemble alongside it.
That's over.
The ZooKeeper Problem¶
ZooKeeper worked, but it was a separate distributed system that had to be deployed, monitored, scaled, and debugged independently. It had its own failure modes, its own disk requirements, its own network partitioning behavior. Running Kafka meant running two distributed systems and hoping they agreed with each other.
KRaft: Kafka's Built-In Consensus¶
KIP-500 (proposed 2019) replaced ZooKeeper with KRaft (Kafka Raft) — a Raft-based
consensus protocol built directly into Kafka brokers. Instead of delegating metadata to
an external system, Kafka manages its own metadata in an internal topic
(__cluster_metadata).
Timeline: - 2019: KIP-500 proposed - 2022 (Kafka 3.3): KRaft declared production-ready - 2023 (Kafka 3.5): ZooKeeper mode officially deprecated - 2024 (Kafka 4.0): ZooKeeper support removed entirely
If you're setting up a new cluster today, use KRaft. There is no reason to use ZooKeeper.
# Check if your cluster uses KRaft or ZooKeeper
# KRaft clusters have a __cluster_metadata topic
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
--describe --topic __cluster_metadata
# KRaft metadata inspection
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
--broker-info
Under the Hood: In a KRaft cluster, some brokers are designated as controllers (typically 3, running the Raft quorum). Controllers manage metadata — topic creation, partition assignments, leader elections. Regular brokers only serve data. This is simpler than the ZooKeeper model, where the controller was a single elected broker that had to sync state between Kafka and ZooKeeper.
Part 8: Schema Registry and Data Contracts¶
Kafka stores bytes. It doesn't know or care whether those bytes are JSON, Avro, Protobuf, or a JPEG. This is a strength — Kafka is format-agnostic — and a trap.
Without schema enforcement, a producer can change its message format and break every consumer downstream. This is exactly what happens when someone adds a field, removes a field, or changes a type without telling anyone.
Schema Registry¶
Confluent Schema Registry (open-source, part of the Confluent Platform) sits alongside your Kafka cluster and manages schemas for topics:
Producer → Schema Registry (validate schema) → Kafka broker
↓
Consumer ← Schema Registry (fetch schema) ← Kafka broker
Supported formats: - Avro — compact binary format with schema evolution. The original and most common. - Protobuf — Google's format. Strong typing, code generation. - JSON Schema — schema enforcement for JSON payloads. Familiar but verbose.
The Registry enforces compatibility rules — it rejects schema changes that would break consumers:
| Compatibility | What's allowed | What's blocked |
|---|---|---|
| BACKWARD | Remove fields, add optional fields | Add required fields |
| FORWARD | Add fields, remove optional fields | Remove required fields |
| FULL | Add/remove optional fields only | Any required field change |
| NONE | Anything goes | Nothing blocked (dangerous) |
Interview Bridge: "How do you handle schema evolution in Kafka?" is a common distributed systems interview question. The answer: Schema Registry with BACKWARD compatibility for consumer-driven evolution, or FORWARD for producer-driven. FULL is the safest but most restrictive.
Part 9: Exactly-Once Semantics — The Myth and the Reality¶
The distributed systems community held for years that exactly-once message delivery was impossible — you could only get at-most-once or at-least-once. Kafka 0.11 (2017) introduced exactly-once semantics (EOS) and proved it was achievable, with caveats.
How It Works¶
Two components:
Idempotent producer (enable.idempotence=true): The producer assigns a sequence number
to each message. The broker deduplicates — if the same message arrives twice (due to a
retry), the broker discards the duplicate. This prevents duplicates from producer retries.
Transactional writes (transactional.id): The producer wraps a batch of writes (to
multiple topics/partitions) in a transaction. Either all writes succeed or none do.
Consumers configured with isolation.level=read_committed only see committed messages.
# Producer — exactly-once config
enable.idempotence=true
transactional.id=order-processor-tx-1
acks=all
The Caveat¶
Gotcha: "Exactly-once" in Kafka is exactly-once within Kafka. The moment your consumer writes to a database, calls an API, or sends an email, you're back to at-least-once unless your side effect is idempotent. The Kafka docs themselves call this "effectively once." EOS guarantees that the read-process-write cycle within Kafka sees no duplicates and no losses. External side effects are your problem.
Cost: EOS adds ~20% throughput overhead due to transaction coordination. Use it for financial transactions, inventory adjustments, and billing. Don't use it for metrics or logs where occasional duplicates are harmless.
Part 10: Kafka Connect and the Integration Story¶
Kafka Connect is Kafka's integration framework — pre-built connectors that move data between Kafka and external systems without writing code.
Source connectors (data → Kafka):
PostgreSQL CDC → Kafka (Debezium connector)
S3 files → Kafka (S3 source connector)
MongoDB oplog → Kafka (MongoDB connector)
Sink connectors (Kafka → data):
Kafka → Elasticsearch (ES sink connector)
Kafka → S3 (S3 sink connector)
Kafka → PostgreSQL (JDBC sink connector)
The most important connector: Debezium — captures every row-level change from a database (inserts, updates, deletes) and streams them to Kafka topics. This is Change Data Capture (CDC) and it's how most organizations get data out of databases and into event streams without modifying application code.
Trivia: Debezium (pronounced "DEE-bee-zee-um") is named after the decibel abbreviation "dBm" — a nod to measuring change signals. It was created at Red Hat and is the most popular CDC tool in the Kafka ecosystem.
Part 11: The Partition Count War Story¶
War Story: A team launched a new event-driven order pipeline with 3 partitions on their
order-eventstopic. At launch, 3 consumers kept up fine — 500 messages/sec, no lag. Six months later, traffic had grown to 8,000 messages/sec. They scaled to 8 consumers, but only 3 were active — the hard limit of 3 partitions meant 5 consumers sat idle. Lag climbed to millions. They couldn't increase partitions because orders used the order ID as a message key: changing the partition count would break per-order ordering (the hash mapping changes). They had to create a new 24-partition topic, deploy a bridge consumer to re-key and forward messages from the old topic, migrate all downstream consumers to the new topic, and drain the old one. The migration took two weeks of engineering time. The lesson: partition count is a one-way door for keyed topics. Estimate your peak parallelism needs for the next 2 years and add 50%.
The sizing heuristic:
| Throughput | Partitions | Reasoning |
|---|---|---|
| < 1,000 msg/s | 6 | Room to grow, minimal overhead |
| 1,000–10,000 msg/s | 12–24 | Match expected consumer count |
| 10,000–100,000 msg/s | 24–64 | High parallelism, monitor broker memory |
| > 100,000 msg/s | 64+ | Consult Kafka docs; watch for metadata overhead |
Gotcha: You can increase partitions on a topic, but never decrease them. And increasing partitions on a keyed topic breaks key-based ordering — the hash function maps keys to different partitions after the change. Plan partition count at topic creation time.
Part 12: Operational Monitoring — The Metrics That Matter¶
The Big Three¶
If you monitor nothing else, monitor these:
1. Consumer lag — messages produced but not yet consumed
# Per-partition lag
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--describe --group order-processor
# Total lag as a single number
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--describe --group order-processor | awk 'NR>1 {sum+=$6} END {print "Total lag:", sum}'
Prometheus query: kafka_consumergroup_lag{consumergroup="order-processor"}
2. Under-replicated partitions — replicas falling behind the leader
Healthy output is no output. Any line means a replica is behind.
3. Broker disk usage — Kafka crashes hard when disk fills
Alert Thresholds (Starting Point)¶
# Consumer lag
- alert: KafkaConsumerLagHigh
expr: kafka_consumergroup_lag > 10000
for: 5m
severity: warning
- alert: KafkaConsumerLagCritical
expr: kafka_consumergroup_lag > 100000
for: 5m
severity: critical
# Under-replicated partitions
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
severity: critical
# Broker disk
- alert: KafkaBrokerDiskHigh
expr: (1 - node_filesystem_avail_bytes{mountpoint="/var/kafka-logs"}
/ node_filesystem_size_bytes{mountpoint="/var/kafka-logs"}) > 0.7
for: 10m
severity: warning
Disk Planning¶
Kafka brokers need disk — lots of it, and it needs to be fast.
Disk needed per broker =
(message_rate × avg_message_size × retention_hours × 3600)
× replication_factor / broker_count
Example:
10,000 msg/s × 1 KB × 168 hours × 3600
× 3 replicas / 3 brokers
= ~6 TB per broker for 7-day retention
Use SSDs for the active segment (writes) and HDDs or tiered storage for older segments if available. Monitor disk I/O latency — slow disks cause ISR shrinkage, which causes under-replicated partitions, which eventually causes data unavailability.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What is KRaft and why does it matter? | KRaft (Kafka Raft) replaces ZooKeeper for metadata management. It's built into Kafka brokers, eliminating the need to run a separate ZooKeeper ensemble. Production-ready since Kafka 3.3 (2022). |
| What does "exactly-once semantics" actually guarantee in Kafka? | That the read-process-write cycle within Kafka produces no duplicates and no losses. External side effects (DB writes, API calls) are NOT covered — those must be idempotent. |
| Why can't you decrease partition count on a topic? | Existing data is spread across the current partitions. Removing partitions would orphan that data. You can increase (new messages use new hash mapping) but never decrease. |
| A topic has 3 partitions and you need 8x parallelism. What do you do? | Create a new topic with 24+ partitions. If the topic uses keyed messages, you need a migration: bridge consumer to re-key messages from old topic to new. You cannot safely increase partitions on keyed topics. |
Part 13: Kafka vs RabbitMQ vs Pulsar¶
You'll encounter all three. Here's when to reach for each.
| Dimension | Kafka | RabbitMQ | Pulsar |
|---|---|---|---|
| Model | Append-only log | Message queue with exchanges | Log + queue (unified) |
| Replay | Yes (consumers track offsets) | No (messages deleted after ack) | Yes (offsets like Kafka) |
| Ordering | Per partition | Per queue | Per partition |
| Throughput | 2M+ msg/s per broker | ~50K msg/s per node | 1M+ msg/s per broker |
| Routing | Partition key only | Exchanges, bindings, routing keys | Topic + subscription |
| Multi-tenancy | Weak (topic-level ACLs) | Moderate (vhosts) | Strong (tenants + namespaces) |
| Storage | Brokers are stateful | Brokers are stateful | Stateless brokers + BookKeeper |
| Protocol | Kafka protocol (proprietary) | AMQP 0-9-1 | Pulsar protocol + AMQP/Kafka adapters |
| Best for | Event streaming, high throughput, fan-out | Task distribution, flexible routing | Multi-tenant, geo-replicated, unified model |
Choose Kafka when you need high-throughput event streaming, replay, and fan-out to multiple consumer groups.
Choose RabbitMQ when you need flexible message routing (topic/header-based), task distribution, and your throughput is under 50K msg/s.
Choose Pulsar when you need multi-tenancy, geographic replication, or independent scaling of compute (brokers) and storage (BookKeeper).
Trivia: LinkedIn processes over 7 trillion messages per day through Kafka. That's roughly 80 million messages per second sustained. This is why Kafka's design optimizes for throughput over routing flexibility — it was built for LinkedIn's activity stream firehose.
Debugging Cheat Sheet — The 3 AM Reference¶
Consumer lag growing¶
1. kafka-consumer-groups.sh --describe --group <name>
→ Which partitions have lag? Any unassigned (-)?
2. kafka-consumer-groups.sh --describe --group <name> --state
→ Is state Stable? Or stuck in rebalance?
3. Check consumer logs for:
- "max.poll.interval.ms exceeded" → reduce max.poll.records
- "CommitFailedException" → consumer was kicked during processing
- OOM / GC pauses → JVM tuning
4. If lag is uniform across partitions → consumer throughput issue
If lag is on specific partitions → check for hot keys or slow processing
Broker down¶
1. tail -100 /var/log/kafka/server.log | grep -i "error\|fatal"
2. df -h /var/kafka-logs → disk full?
3. jstat -gcutil <pid> 1000 5 → GC thrashing?
4. Check under-replicated partitions after broker returns
5. Watch ISR recovery: kafka-topics.sh --describe | grep -v "Isr:.*1,2,3"
Messages seem "lost"¶
1. Check producer acks setting (acks=all?)
2. Check min.insync.replicas (≥ 2?)
3. Check consumer auto.offset.reset (earliest vs latest?)
4. Check retention — did messages expire before consumption?
5. Check if consumer group offsets were reset
6. Check for rebalance that caused offset commit gap
Emergency offset reset¶
# ALWAYS dry-run first
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--group order-processor --topic order-events \
--reset-offsets --to-latest --dry-run
# Stop all consumers in the group FIRST, then execute
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
--group order-processor --topic order-events \
--reset-offsets --to-latest --execute
Gotcha: Never reset offsets on a running consumer group. Some consumers pick up new offsets, others retain old ones. You get both duplicates and gaps. Verify the group state is
Emptybefore executing.
Cheat Sheet — Pin This to Your Wall¶
| Command | What it does |
|---|---|
kafka-topics.sh --describe --topic <t> |
Show partitions, leaders, replicas, ISR |
kafka-topics.sh --describe --under-replicated-partitions |
Find partitions with lagging replicas |
kafka-consumer-groups.sh --describe --group <g> |
Show per-partition lag and consumer assignments |
kafka-consumer-groups.sh --describe --group <g> --state |
Show group state (Stable, Rebalancing, Empty) |
kafka-console-consumer.sh --topic <t> --from-beginning --max-messages 5 |
Peek at messages |
kafka-console-consumer.sh --topic <t> --partition 0 --offset 12345 |
Read from specific offset |
kafka-configs.sh --entity-type topics --entity-name <t> --describe |
Show topic-level config overrides |
du -sh /var/kafka-logs/<topic>-* |
Disk usage per partition |
| Concept | Key number |
|---|---|
| Max consumers per group | = partition count (hard limit) |
| Production replication factor | 3 (tolerates 1 broker failure) |
Production min.insync.replicas |
2 (with acks=all) |
Default max.poll.interval.ms |
300,000 ms (5 minutes) |
Default session.timeout.ms |
45,000 ms (45 seconds, post-KIP-735) |
Default auto.offset.reset |
latest (skips existing messages) |
Exercises¶
Exercise 1: Read the Lag (2 minutes)¶
Given this output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
order-proc order-events 0 50000 50010 10 c1
order-proc order-events 1 30000 90000 60000 c2
order-proc order-events 2 48000 48005 5 -
- Which partition has the most lag?
- Which partition has no consumer assigned?
- What's your first diagnostic step?
Answer
1. Partition 1 — 60,000 messages behind 2. Partition 2 — the `-` in CONSUMER-ID means no consumer is assigned 3. Check the group state (`--state`). An unassigned partition with a running group suggests a rebalance is in progress or the group has fewer consumers than partitions. Also check if consumer c2 on partition 1 is struggling — 60K lag suggests it's processing slowly.Exercise 2: Design a Topic (5 minutes)¶
You're building a payment processing system. Requirements: - 5,000 transactions/sec at peak - All transactions for the same account must be processed in order - Must survive any single broker failure with no data loss - Planning for 3x traffic growth over 2 years
What are your topic settings?
Answer
- **Partitions: 24** — 5K msg/s × 3x growth = 15K msg/s future. At ~1K msg/s per consumer, you need 15 consumers max. 24 partitions gives headroom. - **Replication factor: 3** — survives one broker failure. - **min.insync.replicas: 2** — with `acks=all` on the producer, ensures at least 2 replicas confirm every write. - **Partition key:** account ID — ensures per-account ordering. - **Retention: 7 days** — adjust based on compliance requirements.Exercise 3: Diagnose the Silent Failure (10 minutes)¶
A new consumer group analytics-pipeline starts consuming user-events. The team reports
"no data is coming through." The topic has 1 million existing messages and 500 msg/s of
new traffic. Consumers are running and show Stable state with zero lag.
What happened? How do you fix it?
Answer
The consumer has `auto.offset.reset=latest` (the default). Since this is a new group with no previously committed offsets, it started from the latest offset — skipping all 1 million existing messages. Zero lag is correct: it's consuming new messages as they arrive, but historical data was silently skipped. Fix: Stop consumers, reset offsets to earliest, restart: Then restart the consumers. They'll reprocess from the beginning. Ensure the consumer application is idempotent (processing the same message twice produces the same result). Prevention: Set `auto.offset.reset=earliest` in the consumer config for any group that needs historical data. Document the choice.Takeaways¶
-
Kafka is a commit log, not a queue. Messages are append-only, immutable, and retained. Consumers maintain pointers (offsets) into the log. This single mental model explains partitions, replay, consumer groups, and retention.
-
Consumer lag is your primary health signal. If lag is growing, something is wrong. If lag is zero but shouldn't be, check
auto.offset.reset. Monitor lag per partition, not just total. -
Partition count is a one-way door. You can increase but never decrease. For keyed topics, even increasing breaks ordering. Plan for 2 years of growth at topic creation.
-
Rebalances are the #1 source of consumer stalls. Use cooperative rebalancing, static group membership, and generous poll intervals to minimize them.
-
acks=all+min.insync.replicas=2+ replication factor 3 is the production durability standard. Anything less risks data loss on broker failure. -
KRaft replaced ZooKeeper. New clusters should use KRaft exclusively. ZooKeeper support is removed as of Kafka 4.0.
Related Lessons¶
- When the Queue Backs Up — Deep dive into queue backpressure, dead-letter queues, and RabbitMQ-specific patterns
- Understanding Distributed Systems Without a PhD — Consensus, replication, and the CAP theorem
- The Split-Brain Nightmare — What happens when distributed nodes disagree
- etcd — The Database That Runs Kubernetes — Another Raft-based distributed system, same consensus principles
- Log Pipelines: From printf to Dashboard — Where Kafka often sits in observability pipelines