Pattern: Missing Backpressure¶

ID: FP-020 Family: Cascading Failure Frequency: Common Blast Radius: Multi-Service Detection Difficulty: Moderate

The Shape¶

A producer generates work faster than a consumer can process it. Without backpressure (a mechanism for the consumer to signal "slow down" to the producer), the queue between them grows without bound. Eventually the queue exhausts memory or storage; the system crashes. The producer appears healthy because it's successfully enqueuing; the consumer appears healthy because it's processing at full speed; the queue is the silent failure point.

How You'll See It¶

In Kubernetes¶

Kafka consumer group is undersized: 2 consumers, 10 partitions, 100k messages/min arriving. Consumer lag grows: 10k → 100k → 1M messages behind. The consumer pod memory grows (buffering messages for processing). Eventually OOMKilled (FP-004). Producers don't back off because Kafka accepts messages fine.

In Linux/Infrastructure¶

A log shipper (Filebeat) reads logs faster than it can write to Elasticsearch. Internal queue fills to queue.mem.events: 4096. New log events are dropped silently. df on the log host shows full disk (Filebeat buffering to disk when mem queue full).

In CI/CD¶

Build artifact uploads are queued locally when the artifact registry is slow. Queue grows to fill the build agent's disk. The build job crashes with "no space left on device."

In Networking¶

TCP provides backpressure via window scaling — the receiver advertises how much buffer it has. A zero window (receiver buffer full) stops the sender. This is backpressure working. When application-level queues (message queues, channels) don't have this mechanism, the backpressure must be implemented at the application level.

The Tell¶

Consumer lag (or queue depth) grows monotonically. Producer metrics show all operations successful (no errors). Consumer metrics show full CPU utilization (processing as fast as possible). Queue memory or disk grows without bound.

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Consumer is slow (needs scaling)	Backpressure absent (producer needs throttling)	Scaling consumer may not help if producer can always outpace it
Message loss	Queue overflow causing drops	Queue depth metrics show high-water mark; drop counter increments at queue ceiling
Memory leak	Unbounded queue growth	Memory growth tracks message arrival rate, not time

The Fix (Generic)¶

Immediate: Pause the producer; allow the consumer to drain the queue; monitor queue depth.
Short-term: Add a bounded queue with explicit overflow handling (drop-oldest, drop-newest, or block producer); implement producer rate limiting.
Long-term: Implement backpressure: have the consumer signal "slow down" to the producer (Reactive Streams request(n), TCP flow control at application level, Kafka quota APIs, rate limiters with producer blocking).

Real-World Examples¶

Example 1: Event-driven pipeline: event producer at 50k/min; consumer at 30k/min. Kafka lag grew from 0 to 2M over 6 hours. Consumer OOMKilled from buffering. Event processing was 4 hours behind real time by the time engineers noticed.
Example 2: In-process channel with no buffer limit: chan Event (unbuffered). Under spike, goroutines blocked on send; goroutine count grew to 500k; process OOMKilled.

War Story¶

The system worked fine in testing (10k events/min). In production (60k events/min peak), the Kafka consumer lag alert fired at hour 2 of a promotion. The consumer was working hard — CPU at 100%. But the producer (user actions) was 2× faster. We scaled consumers from 2 to 6 (matching partition count); that bought another 2 hours. Long-term fix: added a rate limiter on the event producer that checked consumer lag via Kafka Admin API and throttled event creation if lag exceeded 50k. Never saw unbounded lag again.

Cross-References¶

Topic Packs: distributed-systems, k8s-ops
Related Patterns: FP-019 (no circuit breaker — same cascade, different trigger), FP-023 (thread pool exhaustion — backpressure failure in thread-based systems)