Systems Thinking Footguns¶

Mistakes where fixing one thing breaks another, or where perfectly reasonable decisions combine into system-wide failures.

1. Adding retries without considering system load¶

Service B returns errors 5% of the time. You add 3 retries. Service B was overloaded — that's why it was erroring. Your retries increase its load by up to 3x. Error rate goes from 5% to 30%. You add more retries. Service B dies completely.

Fix: Before adding retries, determine why the service is failing. If it's overloaded, retries make it worse. Add circuit breakers instead: stop calling the overloaded service, give it time to recover. If you must retry, use exponential backoff with jitter and a retry budget (max 10% of requests are retries).

Under the hood: This is a positive feedback loop — the systems thinking term for a vicious cycle. Load causes errors, errors cause retries, retries increase load. The circuit breaker is a negative feedback loop that breaks the cycle. Understanding feedback loops is the single most valuable mental model for distributed systems reliability.

2. Caching without planning for cache failure¶

You add a cache. It works great. Traffic grows 5x over a year — the database is fine because the cache handles everything. The cache goes down. 5x traffic hits the database directly. The database can't handle 5x its original load. Total outage — worse than before you added the cache.

Fix: Your database must be sized to survive cache failure, even if that means limiting cache-miss throughput with rate limiting. Test cache failure regularly. Implement cache warming on restart. Use a cache-aside pattern where the application can degrade gracefully (slower, not dead) without the cache.

War story: Facebook's 2021 global outage was exacerbated by a cache stampede: when DNS came back, millions of devices simultaneously reconnected, overwhelming backend services whose caches were cold. The recovery took longer than the initial failure because the system couldn't handle a full cold-start at peak traffic.

3. Setting all timeouts to the same value¶

Every service in your stack has a 30-second timeout. When the deepest service is slow, timeouts stack: the outermost caller waits 30 seconds, but so does every intermediate service. Connections pile up at every layer. A single slow endpoint ties up resources across the entire system for 30 seconds per request.

Fix: Timeouts must form a hierarchy. The outermost service should have the longest timeout, and each inner service should have a shorter one. Outer: 10s, Middle: 5s, Inner: 2s. This ensures inner services fail fast, releasing resources, instead of holding them until the outer timeout fires.

Remember: The timeout hierarchy rule: each layer's timeout must be less than the caller's timeout. If Service A (timeout 10s) calls B (timeout 10s) which calls C (timeout 10s), A could wait up to 30s for a response that will never come. The formula: caller timeout > callee timeout + network overhead + retry time.

4. Scaling up instead of shedding load¶

Your system is overloaded. You add more instances. Scaling takes 3 minutes. During those 3 minutes, the existing instances are drowning. The new instances come up cold (empty caches, new connections) and perform worse initially, temporarily making things worse before they help.

Fix: The first response to overload should be load shedding, not scaling. Return 503 for non-critical traffic immediately (takes effect in seconds). Then scale up. Load shedding buys you time. Scaling buys you capacity. You need the time before the capacity arrives.

5. Fixing the symptom instead of the cause¶

CPU usage is high. You add more CPU. It's high again next week. You add more. The actual cause is a database query doing a full table scan because an index was dropped during a migration. You've been buying hardware to compensate for a missing index.

Fix: Before adding resources, always ask: "Why is this resource being consumed?" Use profiling, not provisioning, as the first response. A flame graph costs nothing. More servers cost money every month, and they're treating the symptom while the cause remains.

6. Tightening coupling to improve performance¶

Service A makes 10 HTTP calls to Service B per request. You "optimize" by giving Service A direct database access to B's data. Performance improves immediately. Six months later, Service B changes its schema. Service A breaks because it was reading B's database directly. Every change to B now requires coordinating with A.

Fix: Tight coupling trades long-term reliability for short-term performance. Instead, batch the 10 calls into 1 (API optimization), add a cache, or use async patterns. Never bypass a service's API to access its database — that coupling will cause more outages than the latency you saved.

7. Ignoring queue depth until it's too late¶

Your message queue grows slowly — 100 messages behind, then 500, then 2,000. Nobody alerts on queue depth because "it'll catch up." At 50,000 messages, the consumer OOMs trying to process the backlog. Messages are lost. Even after recovery, the queue has hours of stale messages that are no longer relevant.

Fix: Alert on queue depth AND queue growth rate. A growing queue means arrival rate exceeds processing rate — it will never catch up on its own. Set a max queue size with dead-letter handling. If the queue exceeds a threshold, investigate why consumers can't keep up. Sometimes the right answer is to drop stale messages rather than process a 6-hour backlog.

8. Making the system "more resilient" by adding more components¶

Your system has reliability issues. You add a cache layer, a message queue, a circuit breaker library, a service mesh, and a CDN. Each component adds its own failure modes. Now you have 5 additional things that can break, each with its own configuration, monitoring, and operational expertise requirements. The system is more complex but not more reliable.

Fix: Before adding a component for resilience, ask: "Does this reduce the total number of failure modes, or increase it?" A circuit breaker reduces failure modes (prevents cascades). A service mesh adds failure modes (sidecar crashes, control plane outages) while providing observability. The trade-off may be worth it, but make the trade-off consciously.

9. Optimizing for the average case instead of the tail¶

Your P50 latency is 50ms. Average throughput is 500 RPS. You capacity-plan for averages. At 3am, a batch job runs and adds 200 RPS. At the same time, P99 latency spikes to 2 seconds. Connection pools (sized for average concurrency) exhaust. The system degrades for everyone because you planned for the mean, not the tail.

Fix: Capacity-plan for P99, not P50. Size connection pools for peak concurrency (Little's Law at P99 latency), not average. Account for correlated traffic (batch jobs, cron, marketing events). Reserve 30% headroom above measured peak. The average lies to you — the tail is where outages live.

10. Believing your architecture diagram reflects reality¶

Your architecture diagram shows clean boxes with arrows. Service A calls B. B calls C. Neat and simple. In reality, A also retries on B with 3 attempts, B has a background job that calls A back, C writes to a queue that triggers a lambda that calls A, and everyone talks to the same Redis instance. Your diagram shows the design. The system's actual behavior includes the retries, the crons, the side effects, and the shared resources.

Fix: Map the actual system, not the intended system. Include retry paths, cron jobs, background workers, shared resources, and async callbacks. Use distributed tracing to discover dependencies you didn't know existed. The architecture diagram should make you uncomfortable — if it looks clean, it's incomplete.

Gotcha: Shared resources are hidden coupling. Two services that "don't depend on each other" but share a Redis instance, a Kafka cluster, or a NAT gateway are coupled through that shared resource. When the shared resource degrades, both fail simultaneously — and the architecture diagram shows them as independent.