Skip to content

Kafka - Primer

Why This Matters

Kafka is the backbone of event-driven architectures and real-time data pipelines at scale. When Kafka is healthy, data flows invisibly. When it is not — consumer lag spikes, partitions go offline, or a broker runs out of disk — every downstream system feels it. Ops engineers need to understand Kafka's architecture to monitor it, debug it, and keep it running at 2 AM when the on-call page fires.

Core Concepts

1. Topics, Partitions, and Offsets

A topic is a named stream of records. Topics are split into partitions for parallelism. Each record in a partition has a sequential offset.

Topic: order-events (3 partitions)

Partition 0: [0] [1] [2] [3] [4] [5] ...
Partition 1: [0] [1] [2] [3] ...
Partition 2: [0] [1] [2] [3] [4] [5] [6] [7] ...
# List topics
kafka-topics.sh --bootstrap-server localhost:9092 --list

# Create a topic
kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic order-events \
  --partitions 6 --replication-factor 3

# Describe a topic (partitions, replicas, ISR)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic order-events

# Output:
# Topic: order-events  PartitionCount: 6  ReplicationFactor: 3
# Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
# Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1

Key sizing rules:

  • More partitions = more parallelism (one consumer per partition max)
  • Fewer partitions = simpler operations, less overhead
  • Replication factor of 3 is standard for production (tolerates 1 broker failure)

Fun fact: Kafka was originally developed at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao in 2010. It was named after the author Franz Kafka — Jay Kreps chose the name because "Kafka is a system optimized for writing" and he liked Kafka's writing. It was open-sourced in 2011 and became a top-level Apache project in 2012.

Remember: Kafka ordering guarantee: messages are ordered within a partition, not across partitions. If you need global ordering for a topic, use a single partition (sacrificing parallelism). If you need ordering per entity (e.g., per customer), use the entity ID as the message key — Kafka hashes the key to determine the partition, so all messages for the same key go to the same partition.

2. Producers and Consumers

# Produce messages (testing/debugging)
kafka-console-producer.sh --bootstrap-server localhost:9092 \
  --topic order-events \
  --property "key.separator=:" \
  --property "parse.key=true"
# Type: order-123:{"item":"widget","qty":5}

# Consume from beginning
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events --from-beginning

# Consume with key, value, timestamp, and partition info
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.partition=true \
  --from-beginning

3. Consumer Groups

Consumer groups enable parallel processing. Each partition is consumed by exactly one consumer in the group:

# List consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe a group (see lag per partition)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor

# Output:
# GROUP            TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor  order-events   0          15234           15240           6
# order-processor  order-events   1          12100           15100           3000
# order-processor  order-events   2          14999           15000           1

LAG is the critical metric — it tells you how far behind a consumer is. High lag means consumers cannot keep up with producers.

Gotcha: You can never have more active consumers in a group than partitions in the topic. If a topic has 6 partitions and you scale to 8 consumers, 2 consumers will sit idle. This is a hard limit — plan your partition count based on your expected maximum parallelism. Increasing partitions later is possible but cannot be reversed without recreating the topic.

Under the hood: When Kafka assigns partitions to consumers in a group, it uses a "group coordinator" broker and a rebalancing protocol. Before KIP-429 (cooperative rebalancing, Kafka 2.4+), every rebalance caused a "stop-the-world" pause where ALL consumers in the group stopped processing. With cooperative rebalancing, only the affected consumers pause. If you see sudden lag spikes correlating with consumer joins/leaves, check if you are using the older eager rebalancing strategy.

4. Brokers and Cluster Architecture

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Broker 1 │  │ Broker 2 │  │ Broker 3 │
│ (ctrl)   │  │          │  │          │
│ P0-lead  │  │ P1-lead  │  │ P2-lead  │
│ P1-repl  │  │ P2-repl  │  │ P0-repl  │
│ P2-repl  │  │ P0-repl  │  │ P1-repl  │
└──────────┘  └──────────┘  └──────────┘
# Check broker status (using kafka-metadata.sh for KRaft clusters)
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
  --broker-info

# Check under-replicated partitions (critical alert)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# Check offline partitions (data unavailable)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions

5. Replication and ISR

ISR (In-Sync Replicas) is the set of replicas that are caught up to the leader. If ISR shrinks, you are losing redundancy:

# Check ISR status for all topics
kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep -v "Isr: $(echo {1,2,3} | tr ' ' ',')"

# A healthy partition: Replicas: 1,2,3  Isr: 1,2,3
# An unhealthy partition: Replicas: 1,2,3  Isr: 1,3  (broker 2 fell out of sync)

Key configs:

Config Default Meaning
min.insync.replicas 1 Minimum ISR count for acks=all to succeed
unclean.leader.election.enable false Allow out-of-sync replica to become leader (risks data loss)
replica.lag.time.max.ms 30000 Time before a slow replica is removed from ISR

Default trap: The auto.offset.reset consumer config controls what happens when a consumer group has no committed offset (first time consuming, or offsets expired). The default is latest — meaning it skips all existing messages and only reads new ones. If you deploy a new consumer expecting to process historical data and forget to set auto.offset.reset=earliest, it will silently skip everything. This causes "messages are lost" incidents that are actually a configuration error.

6. Retention and Storage

# Check topic retention settings
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events --describe

# Set retention to 7 days
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events \
  --alter --add-config retention.ms=604800000

# Set retention by size (100GB per partition)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events \
  --alter --add-config retention.bytes=107374182400

# Check disk usage per topic
du -sh /var/kafka-logs/order-events-*

Monitor disk usage — Kafka brokers that run out of disk will crash, taking partitions offline.

7. Lag Monitoring and Alerting

Consumer lag is the single most important Kafka operational metric:

# Quick lag check
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor | awk '{sum += $6} END {print "Total lag: " sum}'

# Burrow (dedicated lag monitoring tool) or Prometheus + kafka_exporter
# Prometheus query for consumer lag:
# kafka_consumergroup_lag{consumergroup="order-processor",topic="order-events"}

Alert thresholds (example):

WARNING:  lag > 10000 for > 5 minutes
CRITICAL: lag > 100000 for > 5 minutes
CRITICAL: lag increasing continuously for > 15 minutes

8. Common Ops Tasks

# Reset consumer group offset (dangerous — causes reprocessing)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --dry-run

# Execute the reset (remove --dry-run, group must be stopped)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --execute

# Reassign partitions (rebalance after adding brokers)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --execute

# Delete a topic
kafka-topics.sh --bootstrap-server localhost:9092 \
  --delete --topic old-events

# Increase partitions (cannot decrease)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic order-events --partitions 12

9. Debugging Checklist

When lag is increasing:
  1. Check consumer logs for errors
  2. Check consumer group membership (rebalancing?)
  3. Check broker CPU/disk/network (producer side ok?)
  4. Check GC pauses on consumer JVMs
  5. Check max.poll.records and processing time per batch

When a broker is down:
  1. Check broker logs: /var/log/kafka/server.log
  2. Check disk space: df -h /var/kafka-logs
  3. Check JVM: jstat -gcutil <pid>
  4. Check ZooKeeper/KRaft connectivity
  5. Check under-replicated partitions

When messages are "lost":
  1. Verify producer acks setting (acks=all for durability)
  2. Check min.insync.replicas
  3. Check consumer auto.offset.reset (earliest vs latest)
  4. Check if consumer group was reset
  5. Check retention  did messages expire?

Timeline: Kafka's evolution: 2010 — developed at LinkedIn. 2011 — open-sourced. 2012 — top-level Apache project. 2014 — Confluent founded by the original Kafka creators. 2019 — KIP-500 proposed replacing ZooKeeper with KRaft (Kafka Raft). 2022 — KRaft production-ready in Kafka 3.3. 2024 — ZooKeeper mode officially deprecated. If you are setting up a new cluster today, use KRaft — ZooKeeper is legacy.

Key Takeaway

Kafka ops comes down to three things: partitions (right count, balanced across brokers, fully replicated), consumer lag (the primary health signal — if lag is growing, something is wrong), and disk (Kafka is a distributed log — it needs disk space and will crash without it). Master the CLI tools, monitor lag religiously, and always check ISR status when things look wrong.


Wiki Navigation