Skip to content

Kafka - Street-Level Ops

Real-world workflows for operating Kafka clusters and debugging consumer issues.

Quick Cluster Health Check

# Check broker status and controller
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
  --broker-info

# Check for under-replicated partitions (most important check)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# Output (healthy = no output):
# Topic: order-events  Partition: 3  Leader: 2  Replicas: 2,1,3  Isr: 2,1

# Check for offline partitions (data unavailable)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions

Consumer Lag Investigation

# Describe consumer group — the first thing you check when lag alerts fire
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor

# Output:
# GROUP            TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor  order-events   0          15234           15240           6
# order-processor  order-events   1          12100           18100           6000
# order-processor  order-events   2          14999           15000           1

# Total lag across all partitions
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor | awk 'NR>1 {sum+=$6} END {print "Total lag:", sum}'

# Check if consumer group is actively consuming (STABLE = good)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor --state

# Output:
# GROUP            COORDINATOR(ID)  STATE     #MEMBERS
# order-processor  broker-1(1)      Stable    3

Default trap: session.timeout.ms defaults to 45s (changed from 10s in KIP-735). If a consumer takes longer than max.poll.interval.ms (default 300s) to process a batch, the broker considers it dead and triggers a rebalance. During rebalance, ALL consumers in the group pause — so one slow consumer stalls the entire group.

Debug clue: If consumer lag keeps growing despite consumers being "running," check kafka-consumer-groups.sh --describe --group <name> --state. If STATE shows PreparingRebalance or CompletingRebalance repeatedly, consumers are stuck in a rebalance loop. Common cause: a consumer crashes during processing, rejoins, gets the same partition, crashes again.

Topic Operations

# Create a topic with production settings
kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic payment-events \
  --partitions 12 --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

# Increase partitions (cannot decrease — one-way operation)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic order-events --partitions 24

# Check topic config
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events --describe

# Change retention on a topic
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name order-events \
  --alter --add-config retention.ms=259200000

Gotcha: Increasing partitions on a topic with keyed messages breaks key-based ordering. If your producer sends order ID 12345 to partition 3 (via hash(key) % partitions), adding partitions changes the hash mapping — future messages for order 12345 may land on partition 7 while historical messages remain on partition 3. Plan partition count at topic creation time.

Consuming Messages for Debugging

# Peek at recent messages (last 5)
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events --max-messages 5 \
  --property print.key=true \
  --property print.timestamp=true \
  --from-beginning

# Consume from a specific partition and offset
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events --partition 1 --offset 12100

# Consume with headers (useful for tracing)
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic order-events --max-messages 3 \
  --property print.headers=true \
  --property print.key=true

Consumer Group Offset Management

# Dry-run: see what resetting offsets would do
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --dry-run

# Reset to latest (skip backlog) — group must be STOPPED
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-latest --execute

# Reset to a specific timestamp (reprocess from a point in time)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic order-events \
  --reset-offsets --to-datetime 2024-03-14T10:00:00.000 --execute

Broker Disk Management

# Check disk usage per topic partition
du -sh /var/kafka-logs/order-events-* | sort -rh

# Output:
# 45G   /var/kafka-logs/order-events-0
# 42G   /var/kafka-logs/order-events-1
# 38G   /var/kafka-logs/order-events-2

# Check log segment files for a partition
ls -lh /var/kafka-logs/order-events-0/*.log

# Check broker-wide disk
df -h /var/kafka-logs

Partition Reassignment

# Generate reassignment plan (after adding new brokers)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4" --generate

# Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --execute

# Monitor reassignment progress
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --verify

# Throttle reassignment to avoid saturating network
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type brokers --entity-default \
  --alter --add-config follower.replication.throttled.rate=50000000

Scale note: Partition reassignment moves data between brokers by replicating the full partition log. A 100GB partition on a 1Gbps link takes ~15 minutes at full speed. Use follower.replication.throttled.rate to cap bandwidth (e.g., 50MB/s) so reassignment does not starve production traffic. Monitor kafka.server:type=ReplicaFetcherManager,name=MaxLag during reassignment.

Emergency: Broker Down

# 1. Check broker logs
tail -100 /var/log/kafka/server.log | grep -i "error\|fatal\|exception"

# 2. Check JVM state
jstat -gcutil $(jps | grep Kafka | awk '{print $1}') 1000

# 3. Check disk space (most common cause)
df -h /var/kafka-logs

# 4. Check file descriptor usage
ls /proc/$(jps | grep Kafka | awk '{print $1}')/fd | wc -l
cat /proc/$(jps | grep Kafka | awk '{print $1}')/limits | grep "Max open files"

# 5. After recovery, verify ISR is caught up
kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep -v "Isr: 1,2,3"