Skip to content

Kafka: The Log That Runs Everything

  • lesson
  • kafka-architecture
  • commit-logs
  • distributed-systems
  • consumer-groups
  • replication
  • schema-registry
  • kraft
  • monitoring
  • message-queues ---# Kafka — The Log That Runs Everything

Topics: Kafka architecture, commit logs, distributed systems, consumer groups, replication, Schema Registry, KRaft, monitoring, message queues Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Strategy: Archaeological + incident-driven Prerequisites: None (everything explained from scratch)


The Mission

It's 3:12 AM. PagerDuty wakes you. The alert reads:

CRITICAL: Consumer lag for group order-processor on topic order-events
exceeding 500,000 messages and growing. Duration: 23 minutes.

You pull up Grafana on your phone. The consumer lag graph is a cliff going up. Producers are publishing fine — 12,000 messages per second, steady. But the consumers? They're crawling. The gap between "messages written" and "messages read" is widening every second.

Downstream, the order confirmation emails stopped. The inventory service is stale. The analytics dashboard froze 20 minutes ago. Everything that depends on consuming from Kafka is silently falling behind.

You need to figure out what's happening, fix it, and understand enough about Kafka's architecture to prevent this from happening again. Let's start.


Part 1: What Kafka Actually Is (It's Not a Queue)

Before touching the terminal, you need one mental model. Get this right and everything else follows. Get it wrong and you'll fight Kafka for years.

Kafka is an append-only commit log.

Not a queue. Not a message bus. A log. Like the write-ahead log in PostgreSQL. Like a ledger where you can only add entries at the end, never edit or delete them.

Traditional queue (RabbitMQ, SQS):        Kafka:
┌───────────────────────┐                 ┌───────────────────────────────────┐
│ msg → [consumer] → ✓  │                 │ [0] [1] [2] [3] [4] [5] [6] ... │
│       (message deleted)│                 │              ↑           ↑       │
└───────────────────────┘                 │          consumer A   consumer B │
                                          │    (each reads at their own pace)│
                                          └───────────────────────────────────┘

In a traditional queue, a message is consumed and deleted. Gone. One consumer reads it, acknowledges it, and the broker throws it away.

In Kafka, messages stay. They sit in the log for hours, days, or years — however long retention is configured. Consumers don't "take" messages. They maintain a pointer (called an offset) into the log and read forward from there. Multiple consumer groups can each maintain their own pointer, replaying the same data independently.

Mental Model: Think of Kafka as a newspaper archive, not a mailbox. A mailbox delivers once and the letter is gone. A newspaper archive lets anyone walk in, pick a date, and read from there forward. Different readers can be on different dates. The archive doesn't care — it just keeps the papers on the shelf until the retention policy says to shred them.

Name Origin: Jay Kreps, one of Kafka's creators at LinkedIn, named it after the Czech author Franz Kafka because "it is a system optimized for writing" and he liked Kafka's work. He later admitted the name has no deep technical significance — it just sounded good for a messaging system. The literary Kafka's work — bureaucratic labyrinths, inexplicable processes, overwhelming systems — is arguably a fitting metaphor for enterprise messaging infrastructure.

The Commit Log Mental Model

This "append-only, immutable, ordered" design is borrowed from database internals. PostgreSQL's WAL (write-ahead log), MySQL's binlog, and Kafka's topic partitions are all the same idea: an ordered sequence of immutable records.

Jay Kreps, Neha Narkhede, and Jun Rao designed Kafka at LinkedIn in 2010. Their insight: the database WAL is the most fundamental data structure in computing — it records "what happened, in order." They took that concept out of the database engine and made it a standalone distributed system. Kreps detailed this in his influential 2013 blog post "The Log: What every software engineer should know about real-time data's unifying abstraction."

This matters operationally because: - Messages are never moved or modified — only appended and eventually expired - Reads are sequential — consumers scan forward through ordered data on disk - Disk is the storage tier — Kafka relies on the OS page cache and sequential I/O, not JVM heap

Under the Hood: Kafka achieves millions of messages per second because sequential disk I/O is fast — often faster than random memory access. A single broker can push 2+ million messages per second (100-byte messages) by leveraging the OS page cache, zero-copy transfers via sendfile(), and batch compression. The JVM heap barely participates in data transfer.


Part 2: Architecture — Brokers, Topics, Partitions

Now that you have the mental model, let's map it to the real components.

The Cluster

A Kafka cluster is a set of brokers — JVM processes that store and serve data. Production clusters typically run 3–12 brokers, each on its own machine (or pod).

Kafka Cluster (3 brokers)

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│   Broker 1   │   │   Broker 2   │   │   Broker 3   │
│   (ctrl)     │   │              │   │              │
│  P0-leader   │   │  P1-leader   │   │  P2-leader   │
│  P1-replica  │   │  P2-replica  │   │  P0-replica  │
│  P2-replica  │   │  P0-replica  │   │  P1-replica  │
└──────────────┘   └──────────────┘   └──────────────┘

Topics and Partitions

A topic is a named stream. order-events, user-signups, payment-results. Topics are split into partitions for parallelism. Each partition is an independent, ordered, append-only log.

Topic: order-events (3 partitions)

Partition 0:  [0] [1] [2] [3] [4] [5] →
Partition 1:  [0] [1] [2] [3] →
Partition 2:  [0] [1] [2] [3] [4] [5] [6] [7] →

The numbers are offsets — sequential IDs within each partition. Offset 0 is the first message ever written to that partition. They never reset, never reuse.

Critical rule: Messages are ordered within a partition, not across partitions. If you need all events for order-123 to arrive in order, use the order ID as the message key — Kafka hashes the key to pick a partition, so all messages with the same key land in the same partition.

# Describe a topic — see partitions, replicas, and ISR
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --topic order-events
Topic: order-events  PartitionCount: 12  ReplicationFactor: 3
  Partition: 0   Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
  Partition: 1   Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
  Partition: 2   Leader: 3  Replicas: 3,1,2  Isr: 3,1,2
  ...

Each line tells you: which broker is the leader (handles reads and writes), which brokers hold replicas, and which replicas are in-sync (ISR).

Segments: How Data Hits Disk

Each partition is stored as a sequence of segment files on the broker's filesystem:

ls -lh /var/kafka-logs/order-events-0/
-rw-r--r-- 1 kafka kafka 1.0G 00000000000000000000.log
-rw-r--r-- 1 kafka kafka  10M 00000000000000000000.index
-rw-r--r-- 1 kafka kafka  10M 00000000000000000000.timeindex
-rw-r--r-- 1 kafka kafka 536M 00000000000001048576.log   ← active segment
-rw-r--r-- 1 kafka kafka  10M 00000000000001048576.index

The filename is the first offset in that segment. Older segments are immutable. Only the active segment receives new writes. Retention deletes entire segments, not individual messages.


Part 3: Back to the Incident — Diagnosing Consumer Lag

You're awake, your laptop is open, the lag is at 600,000 and climbing. Time to diagnose.

Step 1: How Bad Is It?

# Describe the consumer group — see lag per partition
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --group order-processor
GROUP            TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID
order-processor  order-events   0          152340          152346          6      consumer-1-abc
order-processor  order-events   1          121000          181000          60000  consumer-2-def
order-processor  order-events   2          149990          150002          12     consumer-3-ghi
order-processor  order-events   3          98500           180200          81700  -
order-processor  order-events   4          165000          165050          50     consumer-4-jkl
order-processor  order-events   5          110000          179800          69800  -

Two things jump out: 1. Partitions 1, 3, and 5 have massive lag — 60K, 82K, and 70K messages behind 2. Partitions 3 and 5 have no CONSUMER-ID — nobody is consuming them

That - in the consumer column is your problem. Those partitions are orphaned.

Step 2: Check Group State

kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --group order-processor --state
GROUP            COORDINATOR(ID)  STATE               #MEMBERS
order-processor  broker-2(2)      CompletingRebalance  4

CompletingRebalance. The consumer group is stuck in a rebalance loop. Consumers keep joining, getting partitions assigned, then something kicks them out, and the whole group starts over. During each rebalance cycle, all consumers stop processing.

Gotcha: Before KIP-429 (cooperative rebalancing, Kafka 2.4+), every rebalance was stop-the-world — ALL consumers in the group paused, even if only one consumer changed. If you see a group stuck in CompletingRebalance or PreparingRebalance, check your client's partition.assignment.strategy. The default changed to cooperative in newer clients, but legacy configs may still use RangeAssignor (the eager, stop-the-world strategy).

Step 3: Why Is It Rebalancing?

Common causes of rebalance loops: 1. A consumer takes too long to process a batch (exceeds max.poll.interval.ms, default 300 seconds) — the broker considers it dead 2. A consumer crashes, restarts, gets the same heavy partition, crashes again 3. Network flaps between consumers and the group coordinator broker 4. A deployment is rolling and consumers keep joining/leaving

Check the consumer application logs. In this case:

[2026-03-23 03:08:14] ERROR Consumer poll timeout expired. Leaving group.
[2026-03-23 03:08:14] WARN  max.poll.interval.ms=300000 exceeded for partition order-events-1

There it is. One of the consumers is taking longer than 5 minutes to process a single batch. It gets kicked from the group, triggering a full rebalance. During the rebalance, all consumers pause. The slow consumer rejoins, gets a heavy partition again, exceeds the timeout again, and the cycle repeats.

Step 4: The Fix

Immediate fix — reduce the batch size so processing finishes within the poll interval:

# Consumer config — process fewer records per poll
max.poll.records=100          # was 500 (default)
max.poll.interval.ms=600000   # give more breathing room (10 min)

Restart the consumers. Watch the group state:

# Keep watching until state says "Stable"
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --group order-processor --state
GROUP            COORDINATOR(ID)  STATE   #MEMBERS
order-processor  broker-2(2)      Stable  6

Stable. All 6 consumers are assigned partitions and draining the backlog. Lag starts dropping. Crisis averted.


Flashcard Check #1

Question Answer
What does "LAG" mean in kafka-consumer-groups.sh --describe? The difference between the log-end-offset (latest message) and the consumer's current-offset (last committed). It's how many messages the consumer hasn't processed yet.
A consumer group state shows CompletingRebalance. What's happening? Partitions are being redistributed among consumers. During this time, consumption is paused. If it's stuck in this state, a consumer is likely crash-looping or timing out.
You have a topic with 6 partitions and 8 consumers in a group. How many consumers are idle? 2. Kafka assigns at most one consumer per partition per group. Extra consumers sit idle.
What's the default max.poll.interval.ms? 300,000 ms (5 minutes). If a consumer doesn't call poll() within this window, the broker considers it dead and triggers a rebalance.

Part 4: Producer Acknowledgments — The Durability Dial

Now that the incident is resolved, let's understand the system that made this architecture possible. Start with how messages get into Kafka in the first place.

When a producer sends a message, it waits for an acknowledgment. The acks setting controls how many brokers must confirm the write:

acks What happens Durability Latency
0 Producer doesn't wait. Fire and forget. None. Messages can be lost silently. Lowest
1 Leader broker writes to its log and acks. Single broker. If the leader dies before replicating, the message is lost. Medium
all Leader waits for all in-sync replicas (ISR) to write. Full. Message survives any single broker failure. Highest

Gotcha: Before Kafka 3.0, the producer default was acks=1. Since 3.0, it's acks=all. But if you're running an older client library or inherited a config template from 2019, you might still be on acks=1 without knowing it. Check explicitly: grep acks producer.properties.

The gold standard for production:

acks=all
min.insync.replicas=2     # set on the topic or broker
replication.factor=3

This means: 3 copies of every message, and the producer's write won't succeed unless at least 2 replicas confirm. You can lose one broker without losing a single message and without producers experiencing errors.


Part 5: Replication and ISR — Why Brokers Fall Out of Sync

Each partition has a leader and zero or more follower replicas. Followers continuously fetch data from the leader and append it to their own logs. The set of replicas that are caught up to the leader is the ISR (In-Sync Replicas).

# Check for under-replicated partitions — the most important health check
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --under-replicated-partitions

Healthy output: nothing. Any output means a replica has fallen behind.

Topic: order-events  Partition: 4  Leader: 1  Replicas: 1,2,3  Isr: 1,3

Broker 2 dropped out of the ISR for partition 4. That means: - If broker 1 dies right now, you lose data unless min.insync.replicas=2 prevents the write in the first place - If unclean.leader.election.enable=true (don't do this), broker 2 could become leader and serve stale data

Why replicas fall behind: - Broker disk is slow or full - Network congestion between brokers - JVM garbage collection pauses (GC "stop-the-world") - Broker is overloaded with too many partition leaders

# Check disk on the lagging broker
ssh broker-2 'df -h /var/kafka-logs'

# Check GC pauses
ssh broker-2 'jstat -gcutil $(jps | grep Kafka | awk "{print \$1}") 1000 5'
Config Default What it does
min.insync.replicas 1 Minimum ISR count for acks=all to succeed
unclean.leader.election.enable false If true, allows out-of-sync replica to become leader (risks data loss)
replica.lag.time.max.ms 30000 Time before a slow replica is removed from ISR

Remember: ISR shrinkage is a leading indicator. Alert on it immediately. The formula: if ISR count >= min.insync.replicas, writes succeed. If ISR count < min.insync.replicas, producers with acks=all get NotEnoughReplicasException. This is safe — it's telling you the cluster can't meet durability requirements. The dangerous scenario is when unclean.leader.election is enabled and ISR is empty.


Part 6: Consumer Groups and Rebalancing — The Deeper Story

You fixed the incident by adjusting poll settings. But why does rebalancing exist at all, and why is it so disruptive?

How Consumer Groups Work

A consumer group is a set of consumers that cooperate to consume a topic. Kafka assigns each partition to exactly one consumer in the group:

Topic: order-events (6 partitions)

Consumer Group: order-processor (3 consumers)
  consumer-1  →  P0, P1
  consumer-2  →  P2, P3
  consumer-3  →  P4, P5

If consumer-2 dies, its partitions must go somewhere. That's a rebalance — the group coordinator (a designated broker) redistributes partitions among surviving consumers:

After consumer-2 dies:
  consumer-1  →  P0, P1, P2     ← picked up P2
  consumer-3  →  P3, P4, P5     ← picked up P3

Why Rebalancing Causes Pauses

With the old eager rebalancing protocol (default before Kafka 2.4):

  1. A consumer joins or leaves the group
  2. All consumers stop processing and revoke all partition assignments
  3. The coordinator computes new assignments
  4. Consumers receive their new assignments and resume

That bold step is the problem. Even if only one partition needs to move, every consumer stops. In a group with 50 consumers, a single slow consumer can freeze the entire group.

Trivia: Consumer group rebalancing has been called "the worst part of Kafka" by the community. It motivated years of engineering work on KIP-429 (incremental cooperative rebalancing), which landed in Kafka 2.4 (2019). With cooperative rebalancing, only the consumers that need to give up or receive partitions are paused — the rest keep processing.

Mitigating Rebalance Pain

Technique How it helps
Cooperative rebalancing (CooperativeStickyAssignor) Only affected consumers pause during rebalance
Static group membership (group.instance.id) Consumer restarts don't trigger rebalance within session.timeout.ms
Increase session.timeout.ms Tolerate brief consumer hiccups without triggering rebalance
Increase max.poll.interval.ms Allow more processing time per batch before eviction
Decrease max.poll.records Smaller batches finish faster, less likely to exceed poll interval
# Production consumer config to minimize rebalance disruption
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id=order-processor-pod-3    # static membership
session.timeout.ms=45000
max.poll.interval.ms=600000
max.poll.records=200

Flashcard Check #2

Question Answer
What's the difference between acks=1 and acks=all? acks=1: leader confirms the write. acks=all: all in-sync replicas confirm. acks=all + min.insync.replicas=2 is the production standard.
A topic has Replicas: 1,2,3 and Isr: 1,3. What happened? Broker 2 fell out of sync — it's not keeping up with the leader. Check disk I/O, network, and GC on broker 2.
What's the difference between eager and cooperative rebalancing? Eager (legacy): all consumers stop, revoke all partitions, reassign everything. Cooperative (KIP-429, Kafka 2.4+): only affected partitions are revoked and reassigned.
What does group.instance.id do? Enables static group membership. A consumer restart doesn't trigger a rebalance as long as it rejoins before session.timeout.ms expires.

Part 7: KRaft — ZooKeeper Is Gone

If you learned Kafka before 2022, you learned about ZooKeeper — the separate distributed coordination service that Kafka depended on for metadata, controller election, and cluster membership. Every Kafka cluster needed a ZooKeeper ensemble alongside it.

That's over.

The ZooKeeper Problem

ZooKeeper worked, but it was a separate distributed system that had to be deployed, monitored, scaled, and debugged independently. It had its own failure modes, its own disk requirements, its own network partitioning behavior. Running Kafka meant running two distributed systems and hoping they agreed with each other.

KRaft: Kafka's Built-In Consensus

KIP-500 (proposed 2019) replaced ZooKeeper with KRaft (Kafka Raft) — a Raft-based consensus protocol built directly into Kafka brokers. Instead of delegating metadata to an external system, Kafka manages its own metadata in an internal topic (__cluster_metadata).

Timeline: - 2019: KIP-500 proposed - 2022 (Kafka 3.3): KRaft declared production-ready - 2023 (Kafka 3.5): ZooKeeper mode officially deprecated - 2024 (Kafka 4.0): ZooKeeper support removed entirely

If you're setting up a new cluster today, use KRaft. There is no reason to use ZooKeeper.

# Check if your cluster uses KRaft or ZooKeeper
# KRaft clusters have a __cluster_metadata topic
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --topic __cluster_metadata

# KRaft metadata inspection
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
  --broker-info

Under the Hood: In a KRaft cluster, some brokers are designated as controllers (typically 3, running the Raft quorum). Controllers manage metadata — topic creation, partition assignments, leader elections. Regular brokers only serve data. This is simpler than the ZooKeeper model, where the controller was a single elected broker that had to sync state between Kafka and ZooKeeper.


Part 8: Schema Registry and Data Contracts

Kafka stores bytes. It doesn't know or care whether those bytes are JSON, Avro, Protobuf, or a JPEG. This is a strength — Kafka is format-agnostic — and a trap.

Without schema enforcement, a producer can change its message format and break every consumer downstream. This is exactly what happens when someone adds a field, removes a field, or changes a type without telling anyone.

Schema Registry

Confluent Schema Registry (open-source, part of the Confluent Platform) sits alongside your Kafka cluster and manages schemas for topics:

Producer → Schema Registry (validate schema) → Kafka broker
Consumer ← Schema Registry (fetch schema)    ← Kafka broker

Supported formats: - Avro — compact binary format with schema evolution. The original and most common. - Protobuf — Google's format. Strong typing, code generation. - JSON Schema — schema enforcement for JSON payloads. Familiar but verbose.

The Registry enforces compatibility rules — it rejects schema changes that would break consumers:

Compatibility What's allowed What's blocked
BACKWARD Remove fields, add optional fields Add required fields
FORWARD Add fields, remove optional fields Remove required fields
FULL Add/remove optional fields only Any required field change
NONE Anything goes Nothing blocked (dangerous)

Interview Bridge: "How do you handle schema evolution in Kafka?" is a common distributed systems interview question. The answer: Schema Registry with BACKWARD compatibility for consumer-driven evolution, or FORWARD for producer-driven. FULL is the safest but most restrictive.


Part 9: Exactly-Once Semantics — The Myth and the Reality

The distributed systems community held for years that exactly-once message delivery was impossible — you could only get at-most-once or at-least-once. Kafka 0.11 (2017) introduced exactly-once semantics (EOS) and proved it was achievable, with caveats.

How It Works

Two components:

Idempotent producer (enable.idempotence=true): The producer assigns a sequence number to each message. The broker deduplicates — if the same message arrives twice (due to a retry), the broker discards the duplicate. This prevents duplicates from producer retries.

Transactional writes (transactional.id): The producer wraps a batch of writes (to multiple topics/partitions) in a transaction. Either all writes succeed or none do. Consumers configured with isolation.level=read_committed only see committed messages.

# Producer — exactly-once config
enable.idempotence=true
transactional.id=order-processor-tx-1
acks=all

The Caveat

Gotcha: "Exactly-once" in Kafka is exactly-once within Kafka. The moment your consumer writes to a database, calls an API, or sends an email, you're back to at-least-once unless your side effect is idempotent. The Kafka docs themselves call this "effectively once." EOS guarantees that the read-process-write cycle within Kafka sees no duplicates and no losses. External side effects are your problem.

Cost: EOS adds ~20% throughput overhead due to transaction coordination. Use it for financial transactions, inventory adjustments, and billing. Don't use it for metrics or logs where occasional duplicates are harmless.


Part 10: Kafka Connect and the Integration Story

Kafka Connect is Kafka's integration framework — pre-built connectors that move data between Kafka and external systems without writing code.

Source connectors (data → Kafka):
  PostgreSQL CDC → Kafka  (Debezium connector)
  S3 files      → Kafka  (S3 source connector)
  MongoDB oplog → Kafka  (MongoDB connector)

Sink connectors (Kafka → data):
  Kafka → Elasticsearch  (ES sink connector)
  Kafka → S3             (S3 sink connector)
  Kafka → PostgreSQL     (JDBC sink connector)

The most important connector: Debezium — captures every row-level change from a database (inserts, updates, deletes) and streams them to Kafka topics. This is Change Data Capture (CDC) and it's how most organizations get data out of databases and into event streams without modifying application code.

Trivia: Debezium (pronounced "DEE-bee-zee-um") is named after the decibel abbreviation "dBm" — a nod to measuring change signals. It was created at Red Hat and is the most popular CDC tool in the Kafka ecosystem.


Part 11: The Partition Count War Story

War Story: A team launched a new event-driven order pipeline with 3 partitions on their order-events topic. At launch, 3 consumers kept up fine — 500 messages/sec, no lag. Six months later, traffic had grown to 8,000 messages/sec. They scaled to 8 consumers, but only 3 were active — the hard limit of 3 partitions meant 5 consumers sat idle. Lag climbed to millions. They couldn't increase partitions because orders used the order ID as a message key: changing the partition count would break per-order ordering (the hash mapping changes). They had to create a new 24-partition topic, deploy a bridge consumer to re-key and forward messages from the old topic, migrate all downstream consumers to the new topic, and drain the old one. The migration took two weeks of engineering time. The lesson: partition count is a one-way door for keyed topics. Estimate your peak parallelism needs for the next 2 years and add 50%.

The sizing heuristic:

Throughput Partitions Reasoning
< 1,000 msg/s 6 Room to grow, minimal overhead
1,000–10,000 msg/s 12–24 Match expected consumer count
10,000–100,000 msg/s 24–64 High parallelism, monitor broker memory
> 100,000 msg/s 64+ Consult Kafka docs; watch for metadata overhead

Gotcha: You can increase partitions on a topic, but never decrease them. And increasing partitions on a keyed topic breaks key-based ordering — the hash function maps keys to different partitions after the change. Plan partition count at topic creation time.


Part 12: Operational Monitoring — The Metrics That Matter

The Big Three

If you monitor nothing else, monitor these:

1. Consumer lag — messages produced but not yet consumed

# Per-partition lag
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --group order-processor

# Total lag as a single number
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --group order-processor | awk 'NR>1 {sum+=$6} END {print "Total lag:", sum}'

Prometheus query: kafka_consumergroup_lag{consumergroup="order-processor"}

2. Under-replicated partitions — replicas falling behind the leader

kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
  --describe --under-replicated-partitions

Healthy output is no output. Any line means a replica is behind.

3. Broker disk usage — Kafka crashes hard when disk fills

df -h /var/kafka-logs
du -sh /var/kafka-logs/order-events-* | sort -rh | head -5

Alert Thresholds (Starting Point)

# Consumer lag
- alert: KafkaConsumerLagHigh
  expr: kafka_consumergroup_lag > 10000
  for: 5m
  severity: warning

- alert: KafkaConsumerLagCritical
  expr: kafka_consumergroup_lag > 100000
  for: 5m
  severity: critical

# Under-replicated partitions
- alert: KafkaUnderReplicatedPartitions
  expr: kafka_server_replicamanager_underreplicatedpartitions > 0
  for: 5m
  severity: critical

# Broker disk
- alert: KafkaBrokerDiskHigh
  expr: (1 - node_filesystem_avail_bytes{mountpoint="/var/kafka-logs"}
         / node_filesystem_size_bytes{mountpoint="/var/kafka-logs"}) > 0.7
  for: 10m
  severity: warning

Disk Planning

Kafka brokers need disk — lots of it, and it needs to be fast.

Disk needed per broker =
  (message_rate × avg_message_size × retention_hours × 3600)
  × replication_factor / broker_count

Example:
  10,000 msg/s × 1 KB × 168 hours × 3600
  × 3 replicas / 3 brokers
  = ~6 TB per broker for 7-day retention

Use SSDs for the active segment (writes) and HDDs or tiered storage for older segments if available. Monitor disk I/O latency — slow disks cause ISR shrinkage, which causes under-replicated partitions, which eventually causes data unavailability.


Flashcard Check #3

Question Answer
What is KRaft and why does it matter? KRaft (Kafka Raft) replaces ZooKeeper for metadata management. It's built into Kafka brokers, eliminating the need to run a separate ZooKeeper ensemble. Production-ready since Kafka 3.3 (2022).
What does "exactly-once semantics" actually guarantee in Kafka? That the read-process-write cycle within Kafka produces no duplicates and no losses. External side effects (DB writes, API calls) are NOT covered — those must be idempotent.
Why can't you decrease partition count on a topic? Existing data is spread across the current partitions. Removing partitions would orphan that data. You can increase (new messages use new hash mapping) but never decrease.
A topic has 3 partitions and you need 8x parallelism. What do you do? Create a new topic with 24+ partitions. If the topic uses keyed messages, you need a migration: bridge consumer to re-key messages from old topic to new. You cannot safely increase partitions on keyed topics.

Part 13: Kafka vs RabbitMQ vs Pulsar

You'll encounter all three. Here's when to reach for each.

Dimension Kafka RabbitMQ Pulsar
Model Append-only log Message queue with exchanges Log + queue (unified)
Replay Yes (consumers track offsets) No (messages deleted after ack) Yes (offsets like Kafka)
Ordering Per partition Per queue Per partition
Throughput 2M+ msg/s per broker ~50K msg/s per node 1M+ msg/s per broker
Routing Partition key only Exchanges, bindings, routing keys Topic + subscription
Multi-tenancy Weak (topic-level ACLs) Moderate (vhosts) Strong (tenants + namespaces)
Storage Brokers are stateful Brokers are stateful Stateless brokers + BookKeeper
Protocol Kafka protocol (proprietary) AMQP 0-9-1 Pulsar protocol + AMQP/Kafka adapters
Best for Event streaming, high throughput, fan-out Task distribution, flexible routing Multi-tenant, geo-replicated, unified model

Choose Kafka when you need high-throughput event streaming, replay, and fan-out to multiple consumer groups.

Choose RabbitMQ when you need flexible message routing (topic/header-based), task distribution, and your throughput is under 50K msg/s.

Choose Pulsar when you need multi-tenancy, geographic replication, or independent scaling of compute (brokers) and storage (BookKeeper).

Trivia: LinkedIn processes over 7 trillion messages per day through Kafka. That's roughly 80 million messages per second sustained. This is why Kafka's design optimizes for throughput over routing flexibility — it was built for LinkedIn's activity stream firehose.


Debugging Cheat Sheet — The 3 AM Reference

Consumer lag growing

1. kafka-consumer-groups.sh --describe --group <name>
   → Which partitions have lag? Any unassigned (-)?

2. kafka-consumer-groups.sh --describe --group <name> --state
   → Is state Stable? Or stuck in rebalance?

3. Check consumer logs for:
   - "max.poll.interval.ms exceeded" → reduce max.poll.records
   - "CommitFailedException" → consumer was kicked during processing
   - OOM / GC pauses → JVM tuning

4. If lag is uniform across partitions → consumer throughput issue
   If lag is on specific partitions → check for hot keys or slow processing

Broker down

1. tail -100 /var/log/kafka/server.log | grep -i "error\|fatal"
2. df -h /var/kafka-logs               → disk full?
3. jstat -gcutil <pid> 1000 5          → GC thrashing?
4. Check under-replicated partitions after broker returns
5. Watch ISR recovery: kafka-topics.sh --describe | grep -v "Isr:.*1,2,3"

Messages seem "lost"

1. Check producer acks setting (acks=all?)
2. Check min.insync.replicas (≥ 2?)
3. Check consumer auto.offset.reset (earliest vs latest?)
4. Check retention — did messages expire before consumption?
5. Check if consumer group offsets were reset
6. Check for rebalance that caused offset commit gap

Emergency offset reset

# ALWAYS dry-run first
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --dry-run

# Stop all consumers in the group FIRST, then execute
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --execute

Gotcha: Never reset offsets on a running consumer group. Some consumers pick up new offsets, others retain old ones. You get both duplicates and gaps. Verify the group state is Empty before executing.


Cheat Sheet — Pin This to Your Wall

Command What it does
kafka-topics.sh --describe --topic <t> Show partitions, leaders, replicas, ISR
kafka-topics.sh --describe --under-replicated-partitions Find partitions with lagging replicas
kafka-consumer-groups.sh --describe --group <g> Show per-partition lag and consumer assignments
kafka-consumer-groups.sh --describe --group <g> --state Show group state (Stable, Rebalancing, Empty)
kafka-console-consumer.sh --topic <t> --from-beginning --max-messages 5 Peek at messages
kafka-console-consumer.sh --topic <t> --partition 0 --offset 12345 Read from specific offset
kafka-configs.sh --entity-type topics --entity-name <t> --describe Show topic-level config overrides
du -sh /var/kafka-logs/<topic>-* Disk usage per partition
Concept Key number
Max consumers per group = partition count (hard limit)
Production replication factor 3 (tolerates 1 broker failure)
Production min.insync.replicas 2 (with acks=all)
Default max.poll.interval.ms 300,000 ms (5 minutes)
Default session.timeout.ms 45,000 ms (45 seconds, post-KIP-735)
Default auto.offset.reset latest (skips existing messages)

Exercises

Exercise 1: Read the Lag (2 minutes)

Given this output:

GROUP          TOPIC         PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID
order-proc     order-events  0          50000           50010           10     c1
order-proc     order-events  1          30000           90000           60000  c2
order-proc     order-events  2          48000           48005           5      -
  1. Which partition has the most lag?
  2. Which partition has no consumer assigned?
  3. What's your first diagnostic step?
Answer 1. Partition 1 — 60,000 messages behind 2. Partition 2 — the `-` in CONSUMER-ID means no consumer is assigned 3. Check the group state (`--state`). An unassigned partition with a running group suggests a rebalance is in progress or the group has fewer consumers than partitions. Also check if consumer c2 on partition 1 is struggling — 60K lag suggests it's processing slowly.

Exercise 2: Design a Topic (5 minutes)

You're building a payment processing system. Requirements: - 5,000 transactions/sec at peak - All transactions for the same account must be processed in order - Must survive any single broker failure with no data loss - Planning for 3x traffic growth over 2 years

What are your topic settings?

Answer
kafka-topics.sh --bootstrap-server kafka-1.prod:9092 \
  --create --topic payment-transactions \
  --partitions 24 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000
- **Partitions: 24** — 5K msg/s × 3x growth = 15K msg/s future. At ~1K msg/s per consumer, you need 15 consumers max. 24 partitions gives headroom. - **Replication factor: 3** — survives one broker failure. - **min.insync.replicas: 2** — with `acks=all` on the producer, ensures at least 2 replicas confirm every write. - **Partition key:** account ID — ensures per-account ordering. - **Retention: 7 days** — adjust based on compliance requirements.

Exercise 3: Diagnose the Silent Failure (10 minutes)

A new consumer group analytics-pipeline starts consuming user-events. The team reports "no data is coming through." The topic has 1 million existing messages and 500 msg/s of new traffic. Consumers are running and show Stable state with zero lag.

What happened? How do you fix it?

Answer The consumer has `auto.offset.reset=latest` (the default). Since this is a new group with no previously committed offsets, it started from the latest offset — skipping all 1 million existing messages. Zero lag is correct: it's consuming new messages as they arrive, but historical data was silently skipped. Fix: Stop consumers, reset offsets to earliest, restart:
kafka-consumer-groups.sh --bootstrap-server kafka-1.prod:9092 \
  --group analytics-pipeline --topic user-events \
  --reset-offsets --to-earliest --execute
Then restart the consumers. They'll reprocess from the beginning. Ensure the consumer application is idempotent (processing the same message twice produces the same result). Prevention: Set `auto.offset.reset=earliest` in the consumer config for any group that needs historical data. Document the choice.

Takeaways

  1. Kafka is a commit log, not a queue. Messages are append-only, immutable, and retained. Consumers maintain pointers (offsets) into the log. This single mental model explains partitions, replay, consumer groups, and retention.

  2. Consumer lag is your primary health signal. If lag is growing, something is wrong. If lag is zero but shouldn't be, check auto.offset.reset. Monitor lag per partition, not just total.

  3. Partition count is a one-way door. You can increase but never decrease. For keyed topics, even increasing breaks ordering. Plan for 2 years of growth at topic creation.

  4. Rebalances are the #1 source of consumer stalls. Use cooperative rebalancing, static group membership, and generous poll intervals to minimize them.

  5. acks=all + min.insync.replicas=2 + replication factor 3 is the production durability standard. Anything less risks data loss on broker failure.

  6. KRaft replaced ZooKeeper. New clusters should use KRaft exclusively. ZooKeeper support is removed as of Kafka 4.0.