Kafka - Street-Level Ops¶
Real-world workflows for operating Kafka clusters and debugging consumer issues.
Quick Cluster Health Check¶
# Check broker status and controller
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
--broker-info
# Check for under-replicated partitions (most important check)
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-replicated-partitions
# Output (healthy = no output):
# Topic: order-events Partition: 3 Leader: 2 Replicas: 2,1,3 Isr: 2,1
# Check for offline partitions (data unavailable)
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --unavailable-partitions
Consumer Lag Investigation¶
# Describe consumer group — the first thing you check when lag alerts fire
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group order-processor
# Output:
# GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# order-processor order-events 0 15234 15240 6
# order-processor order-events 1 12100 18100 6000
# order-processor order-events 2 14999 15000 1
# Total lag across all partitions
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group order-processor | awk 'NR>1 {sum+=$6} END {print "Total lag:", sum}'
# Check if consumer group is actively consuming (STABLE = good)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group order-processor --state
# Output:
# GROUP COORDINATOR(ID) STATE #MEMBERS
# order-processor broker-1(1) Stable 3
Default trap:
session.timeout.msdefaults to 45s (changed from 10s in KIP-735). If a consumer takes longer thanmax.poll.interval.ms(default 300s) to process a batch, the broker considers it dead and triggers a rebalance. During rebalance, ALL consumers in the group pause — so one slow consumer stalls the entire group.Debug clue: If consumer lag keeps growing despite consumers being "running," check
kafka-consumer-groups.sh --describe --group <name> --state. If STATE showsPreparingRebalanceorCompletingRebalancerepeatedly, consumers are stuck in a rebalance loop. Common cause: a consumer crashes during processing, rejoins, gets the same partition, crashes again.
Topic Operations¶
# Create a topic with production settings
kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic payment-events \
--partitions 12 --replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=604800000
# Increase partitions (cannot decrease — one-way operation)
kafka-topics.sh --bootstrap-server localhost:9092 \
--alter --topic order-events --partitions 24
# Check topic config
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name order-events --describe
# Change retention on a topic
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name order-events \
--alter --add-config retention.ms=259200000
Gotcha: Increasing partitions on a topic with keyed messages breaks key-based ordering. If your producer sends order ID 12345 to partition 3 (via
hash(key) % partitions), adding partitions changes the hash mapping — future messages for order 12345 may land on partition 7 while historical messages remain on partition 3. Plan partition count at topic creation time.
Consuming Messages for Debugging¶
# Peek at recent messages (last 5)
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic order-events --max-messages 5 \
--property print.key=true \
--property print.timestamp=true \
--from-beginning
# Consume from a specific partition and offset
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic order-events --partition 1 --offset 12100
# Consume with headers (useful for tracing)
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic order-events --max-messages 3 \
--property print.headers=true \
--property print.key=true
Consumer Group Offset Management¶
# Dry-run: see what resetting offsets would do
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group order-processor --topic order-events \
--reset-offsets --to-latest --dry-run
# Reset to latest (skip backlog) — group must be STOPPED
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group order-processor --topic order-events \
--reset-offsets --to-latest --execute
# Reset to a specific timestamp (reprocess from a point in time)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group order-processor --topic order-events \
--reset-offsets --to-datetime 2024-03-14T10:00:00.000 --execute
Broker Disk Management¶
# Check disk usage per topic partition
du -sh /var/kafka-logs/order-events-* | sort -rh
# Output:
# 45G /var/kafka-logs/order-events-0
# 42G /var/kafka-logs/order-events-1
# 38G /var/kafka-logs/order-events-2
# Check log segment files for a partition
ls -lh /var/kafka-logs/order-events-0/*.log
# Check broker-wide disk
df -h /var/kafka-logs
Partition Reassignment¶
# Generate reassignment plan (after adding new brokers)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--topics-to-move-json-file topics.json \
--broker-list "1,2,3,4" --generate
# Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file reassignment.json --execute
# Monitor reassignment progress
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file reassignment.json --verify
# Throttle reassignment to avoid saturating network
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type brokers --entity-default \
--alter --add-config follower.replication.throttled.rate=50000000
Scale note: Partition reassignment moves data between brokers by replicating the full partition log. A 100GB partition on a 1Gbps link takes ~15 minutes at full speed. Use
follower.replication.throttled.rateto cap bandwidth (e.g., 50MB/s) so reassignment does not starve production traffic. Monitorkafka.server:type=ReplicaFetcherManager,name=MaxLagduring reassignment.
Emergency: Broker Down¶
# 1. Check broker logs
tail -100 /var/log/kafka/server.log | grep -i "error\|fatal\|exception"
# 2. Check JVM state
jstat -gcutil $(jps | grep Kafka | awk '{print $1}') 1000
# 3. Check disk space (most common cause)
df -h /var/kafka-logs
# 4. Check file descriptor usage
ls /proc/$(jps | grep Kafka | awk '{print $1}')/fd | wc -l
cat /proc/$(jps | grep Kafka | awk '{print $1}')/limits | grep "Max open files"
# 5. After recovery, verify ISR is caught up
kafka-topics.sh --bootstrap-server localhost:9092 --describe | grep -v "Isr: 1,2,3"