devops
l1
topic-pack
kafka --- Portal | Level: L1: Foundations | Topics: Kafka | Domain: DevOps & Tooling

Kafka - Primer¶

Why This Matters¶

Kafka is the backbone of event-driven architectures and real-time data pipelines at scale. When Kafka is healthy, data flows invisibly. When it is not — consumer lag spikes, partitions go offline, or a broker runs out of disk — every downstream system feels it. Ops engineers need to understand Kafka's architecture to monitor it, debug it, and keep it running at 2 AM when the on-call page fires.

Core Concepts¶

1. Topics, Partitions, and Offsets¶

A topic is a named stream of records. Topics are split into partitions for parallelism. Each record in a partition has a sequential offset.

Topic: order-events (3 partitions)

Partition 0: [0] [1] [2] [3] [4] [5] ...
Partition 1: [0] [1] [2] [3] ...
Partition 2: [0] [1] [2] [3] [4] [5] [6] [7] ...

# List topics
kafka-topics.sh --bootstrap-server localhost:9092 --list

# Create a topic
kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic order-events \
  --partitions 6 --replication-factor 3

# Describe a topic (partitions, replicas, ISR)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic order-events

# Output:
# Topic: order-events  PartitionCount: 6  ReplicationFactor: 3
# Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
# Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1

Key sizing rules:

More partitions = more parallelism (one consumer per partition max)
Fewer partitions = simpler operations, less overhead
Replication factor of 3 is standard for production (tolerates 1 broker failure)

Fun fact: Kafka was originally developed at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao in 2010. It was named after the author Franz Kafka — Jay Kreps chose the name because "Kafka is a system optimized for writing" and he liked Kafka's writing. It was open-sourced in 2011 and became a top-level Apache project in 2012.

Remember: Kafka ordering guarantee: messages are ordered within a partition, not across partitions. If you need global ordering for a topic, use a single partition (sacrificing parallelism). If you need ordering per entity (e.g., per customer), use the entity ID as the message key — Kafka hashes the key to determine the partition, so all messages for the same key go to the same partition.

2. Producers and Consumers¶

# Produce messages (testing/debugging)
kafka-console-producer.sh --bootstrap-server localhost:9092 \
  --topic order-events \
  --property "key.separator=:" \
  --property "parse.key=true"
# Type: order-123:{"item":"widget","qty":5}

# Consume from beginning
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events --from-beginning

# Consume with key, value, timestamp, and partition info
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events \
  --property print.key=true \
  --property print.timestamp=true \
  --property print.partition=true \
  --from-beginning

3. Consumer Groups¶

Consumer groups enable parallel processing. Each partition is consumed by exactly one consumer in the group:

# List consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe a group (see lag per partition)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor

# Output:
# GROUP            TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor  order-events   0          15234           15240           6
# order-processor  order-events   1          12100           15100           3000
# order-processor  order-events   2          14999           15000           1

LAG is the critical metric — it tells you how far behind a consumer is. High lag means consumers cannot keep up with producers.

Gotcha: You can never have more active consumers in a group than partitions in the topic. If a topic has 6 partitions and you scale to 8 consumers, 2 consumers will sit idle. This is a hard limit — plan your partition count based on your expected maximum parallelism. Increasing partitions later is possible but cannot be reversed without recreating the topic.

Under the hood: When Kafka assigns partitions to consumers in a group, it uses a "group coordinator" broker and a rebalancing protocol. Before KIP-429 (cooperative rebalancing, Kafka 2.4+), every rebalance caused a "stop-the-world" pause where ALL consumers in the group stopped processing. With cooperative rebalancing, only the affected consumers pause. If you see sudden lag spikes correlating with consumer joins/leaves, check if you are using the older eager rebalancing strategy.

4. Brokers and Cluster Architecture¶

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Broker 1 │  │ Broker 2 │  │ Broker 3 │
│ (ctrl)   │  │          │  │          │
│ P0-lead  │  │ P1-lead  │  │ P2-lead  │
│ P1-repl  │  │ P2-repl  │  │ P0-repl  │
│ P2-repl  │  │ P0-repl  │  │ P1-repl  │
└──────────┘  └──────────┘  └──────────┘

# Check broker status (using kafka-metadata.sh for KRaft clusters)
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
  --broker-info

# Check under-replicated partitions (critical alert)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# Check offline partitions (data unavailable)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions

5. Replication and ISR¶

ISR (In-Sync Replicas) is the set of replicas that are caught up to the leader. If ISR shrinks, you are losing redundancy:

# Check ISR status for all topics
kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep -v "Isr: $(echo {1,2,3} | tr ' ' ',')"

# A healthy partition: Replicas: 1,2,3  Isr: 1,2,3
# An unhealthy partition: Replicas: 1,2,3  Isr: 1,3  (broker 2 fell out of sync)

Key configs:

Config	Default	Meaning
`min.insync.replicas`	1	Minimum ISR count for acks=all to succeed
`unclean.leader.election.enable`	false	Allow out-of-sync replica to become leader (risks data loss)
`replica.lag.time.max.ms`	30000	Time before a slow replica is removed from ISR

Default trap: The auto.offset.reset consumer config controls what happens when a consumer group has no committed offset (first time consuming, or offsets expired). The default is latest — meaning it skips all existing messages and only reads new ones. If you deploy a new consumer expecting to process historical data and forget to set auto.offset.reset=earliest, it will silently skip everything. This causes "messages are lost" incidents that are actually a configuration error.

6. Retention and Storage¶

# Check topic retention settings
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events --describe

# Set retention to 7 days
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events \
  --alter --add-config retention.ms=604800000

# Set retention by size (100GB per partition)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events \
  --alter --add-config retention.bytes=107374182400

# Check disk usage per topic
du -sh /var/kafka-logs/order-events-*

Monitor disk usage — Kafka brokers that run out of disk will crash, taking partitions offline.

7. Lag Monitoring and Alerting¶

Consumer lag is the single most important Kafka operational metric:

# Quick lag check
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor | awk '{sum += $6} END {print "Total lag: " sum}'

# Burrow (dedicated lag monitoring tool) or Prometheus + kafka_exporter
# Prometheus query for consumer lag:
# kafka_consumergroup_lag{consumergroup="order-processor",topic="order-events"}

Alert thresholds (example):

WARNING:  lag > 10000 for > 5 minutes
CRITICAL: lag > 100000 for > 5 minutes
CRITICAL: lag increasing continuously for > 15 minutes

8. Common Ops Tasks¶

# Reset consumer group offset (dangerous — causes reprocessing)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --dry-run

# Execute the reset (remove --dry-run, group must be stopped)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --execute

# Reassign partitions (rebalance after adding brokers)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --execute

# Delete a topic
kafka-topics.sh --bootstrap-server localhost:9092 \
  --delete --topic old-events

# Increase partitions (cannot decrease)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic order-events --partitions 12

9. Debugging Checklist¶

When lag is increasing:
  1. Check consumer logs for errors
  2. Check consumer group membership (rebalancing?)
  3. Check broker CPU/disk/network (producer side ok?)
  4. Check GC pauses on consumer JVMs
  5. Check max.poll.records and processing time per batch

When a broker is down:
  1. Check broker logs: /var/log/kafka/server.log
  2. Check disk space: df -h /var/kafka-logs
  3. Check JVM: jstat -gcutil <pid>
  4. Check ZooKeeper/KRaft connectivity
  5. Check under-replicated partitions

When messages are "lost":
  1. Verify producer acks setting (acks=all for durability)
  2. Check min.insync.replicas
  3. Check consumer auto.offset.reset (earliest vs latest)
  4. Check if consumer group was reset
  5. Check retention — did messages expire?

Timeline: Kafka's evolution: 2010 — developed at LinkedIn. 2011 — open-sourced. 2012 — top-level Apache project. 2014 — Confluent founded by the original Kafka creators. 2019 — KIP-500 proposed replacing ZooKeeper with KRaft (Kafka Raft). 2022 — KRaft production-ready in Kafka 3.3. 2024 — ZooKeeper mode officially deprecated. If you are setting up a new cluster today, use KRaft — ZooKeeper is legacy.

Key Takeaway¶

Kafka ops comes down to three things: partitions (right count, balanced across brokers, fully replicated), consumer lag (the primary health signal — if lag is growing, something is wrong), and disk (Kafka is a distributed log — it needs disk space and will crash without it). Master the CLI tools, monitor lag religiously, and always check ISR status when things look wrong.

Kafka Flashcards (CLI) (flashcard_deck, L1) — Kafka
RabbitMQ & Message Queues (Topic Pack, L2) — Kafka

Kafka - Primer¶

Why This Matters¶

Core Concepts¶

1. Topics, Partitions, and Offsets¶

2. Producers and Consumers¶

3. Consumer Groups¶

4. Brokers and Cluster Architecture¶

5. Replication and ISR¶

6. Retention and Storage¶

7. Lag Monitoring and Alerting¶

8. Common Ops Tasks¶

9. Debugging Checklist¶

Key Takeaway¶

Wiki Navigation¶

Pages that link here¶

Kafka - Primer¶

Why This Matters¶

Core Concepts¶

1. Topics, Partitions, and Offsets¶

2. Producers and Consumers¶

3. Consumer Groups¶

4. Brokers and Cluster Architecture¶

5. Replication and ISR¶

6. Retention and Storage¶

7. Lag Monitoring and Alerting¶

8. Common Ops Tasks¶

9. Debugging Checklist¶

Key Takeaway¶

Wiki Navigation¶

Related Content¶

Pages that link here¶