Kafka Footguns¶

Mistakes that cause data loss, consumer outages, or cascading pipeline failures.

1. Resetting consumer offsets on a running group¶

You run --reset-offsets --execute while consumers are still active. The group rebalances, some consumers pick up the new offsets, others do not. Messages get processed twice or skipped entirely. State is inconsistent.

Fix: Always stop all consumers in the group before resetting offsets. Verify with --describe --group that the group state is Empty before executing. Always --dry-run first.

War story: A documented incident at Trendyol involved losing 10,000 orders during a deploy when auto-commit was enabled and the consumer crashed between commit and processing. Offset resets on running groups compound this — some consumers pick up new offsets while others retain old ones, causing both duplicates and gaps.

2. Setting `acks=1` for critical data¶

Producer sends a message, leader acknowledges, then the leader dies before replicating. The message is lost permanently. With acks=1 you have no durability guarantee beyond a single broker.

Fix: Use acks=all for any data you cannot afford to lose. Combine with min.insync.replicas=2 to ensure at least two brokers confirm the write.

Default trap: acks=1 is the Kafka producer default (pre-3.0). Since Kafka 3.0, the default changed to acks=all — but if you're on an older client library or copied config from a legacy template, you're still running acks=1. Check explicitly.

3. Increasing partitions on a keyed topic¶

You increase partitions from 6 to 12 on a topic that uses key-based routing. The hash function now maps keys to different partitions. Order-123 used to go to partition 2, now it goes to partition 8. Consumers that rely on per-key ordering see interleaved events.

Fix: Never increase partitions on a keyed topic without a migration plan. Create a new topic with the desired partition count and re-produce with updated routing.

4. Running out of disk on a broker¶

Kafka broker fills its disk. It cannot write new log segments. The broker crashes. All partitions with leaders on that broker go offline. Producers start failing. If other brokers are also near capacity, cascade failure begins.

Fix: Monitor disk usage and alert at 70%. Set log.retention.bytes per topic. Use log.dirs with multiple disks. Kafka has no graceful disk-full handling — it crashes.

Under the hood: When a broker's disk fills, it throws java.io.IOException: No space left on device and the log segment cannot be rolled. The broker shuts down hard. Unlike databases that can go read-only, Kafka has no "degraded" disk-full mode — it's crash or nothing.

5. Enabling `unclean.leader.election`¶

You set unclean.leader.election.enable=true to improve availability. A broker falls behind on replication, then the leader dies. The out-of-sync broker becomes leader. All messages written to the old leader since the last sync are permanently lost.

Fix: Keep unclean.leader.election.enable=false (the default). Accept a brief unavailability window over permanent data loss. Fix the underlying replication lag instead.

6. Not monitoring consumer lag¶

Consumers fall behind by millions of messages. Nobody notices for hours because there is no lag monitoring. By the time the alert fires (on downstream symptoms), the backlog takes hours to clear and downstream systems are stale.

Fix: Monitor consumer lag per partition via kafka-consumer-groups.sh --describe or a dedicated tool (Burrow, kafka_exporter + Prometheus). Alert when lag exceeds thresholds or is continuously increasing.

Debug clue: Lag that grows steadily means consumers are slower than producers — scale up. Lag that spikes and recovers means GC pauses or rebalances. Lag that jumps to millions overnight usually means the consumer group died and auto.offset.reset=latest skipped everything.

7. Setting `auto.offset.reset=latest` without understanding it¶

A new consumer group starts consuming a topic. With auto.offset.reset=latest, it skips all existing messages and only processes new ones. Historical data is silently missed. With earliest, it reprocesses everything from the beginning — potentially millions of messages.

Fix: Choose deliberately. Use latest only when you genuinely do not care about existing messages. Use earliest with idempotent processing. Document the choice.

8. Using one giant partition¶

You create a topic with 1 partition for simplicity. Maximum consumer parallelism is 1. When throughput increases, you cannot scale consumers. The single consumer cannot keep up, lag grows, and you cannot fix it without creating a new topic.

Fix: Start with at least 6-12 partitions for topics expected to grow. You can always have fewer consumers than partitions, but never more.

9. Deleting a topic that others depend on¶

You delete order-events thinking it is unused. Three downstream services lose their input. They either crash or silently stop processing. There is no undo — the topic and all its data are gone.

Fix: Check consumer groups before deleting: kafka-consumer-groups.sh --list then --describe each. Verify no active consumers. Keep a topic registry documenting ownership and dependencies.

10. Ignoring ISR shrinkage¶

ISR drops from 1,2,3 to 1,3 on several partitions. You ignore it because data is still being served. Broker 1 dies. Now only broker 3 has the data. One more failure and the data is gone. You had zero redundancy and did not know it.

Fix: Alert on any ISR shrinkage. Investigate immediately: check the lagging broker's disk I/O, network, and GC pauses. ISR shrinkage is a leading indicator of data loss risk.

Remember: ISR = In-Sync Replicas. The formula for data safety: if ISR count >= min.insync.replicas, writes succeed. If ISR count < min.insync.replicas, producers with acks=all get errors (safe). If unclean.leader.election=true and ISR is empty, an out-of-sync replica becomes leader and you lose data (unsafe).