Skip to content

Postmortem: Unbounded Kafka Topic Exhausts Broker Disk

Field Value
ID PM-014
Date 2025-09-09
Severity SEV-2
Duration 41m (detection to resolution)
Time to Detect 9m
Time to Mitigate 41m
Customer Impact Order confirmation emails and push notifications delayed 41 minutes for ~28,000 customers; audit log writes stalled for the full duration; no orders were lost
Revenue Impact None directly; risk of customer churn from silent notification delay estimated at <$5,000
Teams Involved Data Engineering, Platform Engineering, SRE, Compliance
Postmortem Author Okonkwo Eze
Postmortem Date 2025-09-12

Executive Summary

On September 9, 2025, a Kafka consumer group for the user-activity-raw topic fell 6 hours behind after a deployment introduced a deserialization bug that caused the consumer to crash-loop on unrecognized Avro schema fields. The user-activity-raw topic had no retention policy configured (Kafka defaults to infinite retention with no size limit), so all unconsumed messages accumulated on disk. Over approximately 5 hours of unchecked lag growth, the Kafka broker's data partition grew from 380 GB to 100% of its 500 GB disk allocation. When the disk reached capacity, Kafka entered a read-only failure mode and rejected all new produce requests across all topics on the broker — not just the problematic user-activity-raw topic. Order events, notification events, and audit log writes all stalled. The consumer lag alert threshold was set to 1 hour, which would have fired, but it never fired because the metric pipeline that evaluates consumer lag also writes to Kafka. Full recovery required freeing disk space by force-deleting old segments and redeploying the consumer with a fixed schema.

Timeline (All times UTC)

Time Event
03:12:00 user-activity-consumer v2.4.1 deployed to production via automated rollout; release notes indicate "schema evolution: add new optional fields"
03:12:45 Consumer begins crashing on deserialization: UnknownFieldException: field 'session_metadata' not present in registered schema v3; consumer enters CrashLoopBackOff
03:13:00 Consumer group user-activity-cg stops committing offsets; user-activity-raw topic lag begins growing at ~800k messages/hour (normal production rate)
08:20:00 Consumer lag reaches approximately 4.1 million messages (6 hours of production traffic); broker disk at 91%
08:31:00 Broker disk at 95%; Kafka begins throttling replication to conserve IOPS
08:41:00 Broker disk at 100%; Kafka broker kafka-broker-03 enters read-only mode; all produce requests to this broker return NotLeaderOrFollower or KafkaStorageException
08:41:05 order-events producer begins logging TimeoutException on produce calls; order confirmation pipeline stalls
08:41:10 notification-dispatcher producer logs KafkaStorageException; push notification pipeline stalls
08:41:15 audit-log-writer producer logs repeated produce failures; audit events begin buffering in application memory
08:50:00 SRE on-call Beatriz Santos receives PagerDuty alert: "order-events producer error rate > 5% for 60 seconds"
08:50:45 Beatriz acknowledges; observes kafka-broker-03 errors across multiple producers
08:51:30 Beatriz checks broker metrics; observes disk at 100%, identifies user-activity-raw as source of growth
08:53:00 Data Engineering on-call Chidi Obi joined; confirms consumer deployment at 03:12 and deserialization crash
08:55:00 War room opened; Platform Engineering joins
08:57:00 Decision made: delete oldest log segments from user-activity-raw to free disk, then redeploy fixed consumer
09:00:00 Chidi prepares kafka-delete-records command targeting offsets up to T-7h (keeping last 1 hour of data)
09:03:00 kafka-delete-records executed; 310 GB of old segments scheduled for deletion
09:05:00 Segment deletion completes; broker disk drops to 38%; Kafka exits read-only mode
09:05:30 All producers resume successfully; order-events, notification-dispatcher, audit-log-writer begin writing
09:08:00 Fixed user-activity-consumer v2.4.2 deployed (schema field made nullable); consumer begins processing backlog
09:22:00 Consumer lag back to zero; all notification and order confirmation backlog flushed and delivered
09:31:00 All systems healthy; incident declared resolved
10:00:00 Compliance team notified of audit log gap; gap duration: 41 minutes (08:41 – 09:05 + flush time); confirmed no regulatory threshold exceeded

Impact

Customer Impact

Approximately 28,000 customers who placed orders or received account alerts between 08:41 and 09:22 experienced delayed confirmation emails and push notifications, with delays ranging from 5 to 41 minutes depending on when their event was queued. No orders were lost — the order processing pipeline writes synchronously to the database before publishing to Kafka, so the Kafka failure affected notifications only, not order persistence. Customers who attempted to check order status via the app received accurate status (the status API reads from the database, not Kafka).

Internal Impact

  • SRE: 2 engineers × 2 hours = 4 engineering-hours
  • Data Engineering: 2 engineers × 3 hours = 6 engineering-hours (deployment, fix, schema investigation)
  • Platform Engineering: 1 engineer × 1.5 hours = 1.5 engineering-hours (broker capacity assessment)
  • Compliance: 1 engineer × 2 hours (audit log gap analysis and documentation for regulatory review)
  • Sprint impact: the Data Engineering team's planned sprint work was blocked for the morning; one story missed its release window

Data Impact

Approximately 7 hours of user-activity-raw messages (the oldest segments deleted to free disk) were permanently deleted. This data represents user activity events used for analytics and recommendation model training. It does not include any order, payment, or PII-sensitive data. The data loss affects analytics fidelity for the September 9 cohort but does not trigger any regulatory notification requirement (confirmed with Compliance). Audit log events were buffered in application memory during the outage and successfully flushed once Kafka recovered; no audit events were lost.

Root Cause

What Happened (Technical)

The user-activity-consumer v2.4.1 deployment introduced a schema change where a new field session_metadata (type: map) was added to the Avro schema registered under schema ID v4. The consumer's deserialization code used a strict reader schema (v3) that does not include session_metadata and was compiled without the FORWARD compatibility flag. When the consumer attempted to deserialize a message produced under schema v4, the Avro library threw UnknownFieldException and the consumer process exited. Kubernetes restarted the consumer immediately, which attempted to process the same message, threw the same exception, and exited again — creating a crash loop that consumed zero messages while the topic continued to receive production traffic at approximately 800,000 messages per hour.

The user-activity-raw topic was created 14 months earlier by an engineer who is no longer at the company. It was created via the Kafka CLI (kafka-topics.sh --create) directly, bypassing the internal topic governance process that enforces retention policies. The topic had retention.ms=-1 and retention.bytes=-1 (infinite, the Kafka defaults) and segment.bytes=536870912 (512 MB segments). With no retention policy, the broker's disk was the only limit — and no alert existed on Kafka broker disk utilization.

The monitoring pipeline that evaluates consumer lag (offset-based lag via kafka-consumer-groups.sh metrics, exported to Prometheus by a Kafka Exporter sidecar) uses Kafka itself to write its own internal state. When kafka-broker-03 entered read-only mode, the Kafka Exporter's internal writes failed, causing it to enter a degraded state where it stopped scraping and exporting consumer lag metrics. The consumer lag alert for user-activity-cg therefore never fired despite the group being 4+ hours behind — the alerting system's data pipeline was broken by the same condition it was supposed to detect.

Contributing Factors

  1. Topic created outside governance process with no retention policy: The internal topic governance process (make kafka-topic-create in the data-platform Makefile) enforces minimum configuration: retention.ms=604800000 (7 days), retention.bytes=107374182400 (100 GB), and requires a topic owner to be registered. The user-activity-raw topic was created directly via CLI, bypassing all of these guardrails. No audit of existing topics for compliance with the governance baseline had been performed.

  2. Disk utilization not monitored on Kafka brokers: The monitoring stack included CPU, network throughput, and JVM heap utilization for Kafka brokers, but not disk utilization. This is a common omission because Kafka operators often assume retention policies bound disk growth — an assumption that fails when topics have no retention policy. Disk at >80% is a straightforward metric available via node_exporter; the alert simply did not exist.

  3. Consumer lag alert relied on the same Kafka infrastructure it was monitoring: The Kafka Exporter that produces consumer lag metrics writes its own operational state to Kafka. When the broker became read-only, the Kafka Exporter failed silently, stopping metric export. The consumer lag alert threshold of 1 hour would have fired — but the underlying metric was absent rather than high. A "metrics absent" alert (Prometheus absent() function) would have caught the Kafka Exporter failure independently.

What We Got Lucky About

  1. Kafka entered read-only mode rather than corrupting existing message data. The broker's failure mode on a full disk is to reject new writes while preserving the integrity of existing log segments. All previously written messages across all topics on the broker were fully intact and readable once disk was freed. A storage system that began corrupting data on full-disk would have been significantly more severe.
  2. The audit-log-writer service buffered failed Kafka writes in memory (a feature added in a previous reliability sprint) rather than dropping them. When Kafka recovered, the buffer flushed successfully, meaning no audit events were lost. If the service had dropped events on failure, this would have triggered a compliance incident requiring regulatory notification.

Detection

How We Detected

Detection occurred via a PagerDuty alert on the order-events producer error rate exceeding 5% for 60 seconds. This alert fired 9 minutes after the broker entered read-only mode. The order-events producer error rate is monitored independently of Kafka's internal health because the alert is computed from application-side error counters, not from Kafka metrics — making it resilient to Kafka infrastructure failures. This was the correct detection path; the dedicated Kafka health alert did not fire.

Why We Didn't Detect Sooner

The consumer crash loop began at 03:12 and the broker reached full disk at 08:41 — a 5.5-hour window where the problem was growing undetected. Three detection mechanisms should have caught this earlier and all failed: (1) the consumer lag alert threshold was 1 hour, which would have fired at approximately 04:12, but the Kafka Exporter stopped exporting metrics; (2) no "metrics absent" alert existed for Kafka Exporter outputs; (3) no disk utilization alert existed for Kafka broker nodes. Any one of these would have caught the issue hours before it became an outage.

Response

What Went Well

  1. Beatriz identified the root cause (disk full on kafka-broker-03 from user-activity-raw) within 90 seconds of seeing the broker errors — the Kafka broker metrics dashboard was well-organized and disk utilization was visible even without an alert.
  2. The kafka-delete-records approach was the correct recovery action and Chidi executed it quickly and correctly, freeing 310 GB without affecting any other topic's data.
  3. The v2.4.2 fix (making session_metadata nullable in the reader schema) was prepared and tested in a staging environment before deployment to production, adding only 6 minutes to recovery time but ensuring the fix was correct.

What Went Poorly

  1. The consumer crash loop ran undetected for 5.5 hours. The cascade from a single misbehaving consumer to a full broker disk outage took more than 5 hours to develop, meaning there were multiple points where earlier detection would have prevented the outage entirely.
  2. The war room took 10 minutes to identify the user-activity-consumer deployment as the root cause, partly because the deployment was at 03:12 (5+ hours earlier) and did not appear on the "recent changes" dashboard, which only shows changes in the past 2 hours by default.
  3. Compliance was not notified until after the incident was resolved, 88 minutes after the audit log gap began. The incident response checklist does not include "notify Compliance if audit logging is affected," which it should — Compliance needs to assess the gap in real-time, not retroactively.

Action Items

ID Action Priority Owner Status Due Date
PM-014-01 Add node_exporter disk utilization alert for all Kafka broker nodes; threshold: >80% for 5 minutes (warn), >90% for 1 minute (page) P0 SRE (Beatriz Santos) Done 2025-09-10
PM-014-02 Audit all existing Kafka topics for missing retention configuration; apply 7-day / 100GB retention policy to any topic with retention.ms=-1 P0 Data Engineering (Chidi Obi) In Progress 2025-09-19
PM-014-03 Add Prometheus absent() alert for Kafka Exporter consumer lag metrics; fire if metric absent for >5 minutes P1 SRE (Beatriz Santos) In Progress 2025-09-16
PM-014-04 Enforce topic governance: block kafka-topics.sh --create calls in production via broker-side ACL requiring topic creation to go through the governance API P1 Platform Engineering (Okonkwo Eze) Planned 2025-09-30
PM-014-05 Extend "recent changes" dashboard default time window from 2 hours to 24 hours; add filter for consumer group deployment events P2 Platform Engineering Planned 2025-09-23
PM-014-06 Add "audit logging affected" check to incident response checklist; if true, page Compliance lead within 15 minutes of detection P1 SRE (Beatriz Santos) Done 2025-09-11

Lessons Learned

  1. A single topic without a retention policy can take down all topics on a shared broker: Kafka's disk-full failure mode is broker-wide, not topic-scoped. One unbounded topic starves every other topic on the same broker. This makes topic-level governance (enforced retention policies) a cluster-wide reliability concern, not just a per-topic housekeeping matter.

  2. Monitoring infrastructure that uses the system it monitors creates a blind spot at the worst moment: The consumer lag alerting pipeline used Kafka to write its own state. When Kafka failed, the pipeline failed too — silently. Critical monitoring pipelines should be designed to fail open (raise alerts) rather than fail silent (stop exporting metrics). The absent() pattern in Prometheus is the correct tool for detecting silent monitoring failures.

  3. Deployment-to-outage gaps longer than 1 hour make root cause correlation harder: The consumer crash loop began 5.5 hours before the outage. By the time the war room convened, the deployment that caused the crash was beyond the "recent changes" dashboard's default window. On-call engineers should have instant access to a full 24-hour change log, not just a 2-hour window, because slow-developing failures are common in capacity-related incidents.

Cross-References

  • Failure Pattern: Unbounded Resource Consumption; Monitoring Blind Spot; Blast Radius Expansion via Shared Infrastructure
  • Topic Packs: kafka-operations (retention policies, topic governance, disk management, consumer groups), observability (absent-metric alerting, Prometheus patterns), schema-evolution (Avro compatibility modes, forward/backward compatibility)
  • Runbook: runbook-kafka-disk-full.md, runbook-kafka-consumer-lag.md
  • Decision Tree: Producer errors on Kafka → check broker disk utilization → if disk full, identify largest topic by size → assess retention policy → delete oldest segments with kafka-delete-records to free space → identify consumer lag on large topic → fix consumer and redeploy