Systems Thinking¶

17 cards — 🟢 3 easy | 🟡 4 medium | 🔴 3 hard

🟢 Easy (3)¶

1. What is the difference between a negative (balancing) and positive (reinforcing) feedback loop in infrastructure?

Show answer

A negative feedback loop maintains stability (e.g., autoscaler adds pods when request rate rises, then removes pods when it drops). A positive feedback loop amplifies change until something breaks (e.g., retry storm: timeouts cause retries, retries increase load, more timeouts, more retries, total failure).

Remember: "Systems thinking = see the forest, not just the trees." Focus on relationships and feedback loops, not isolated components.

2. How does component thinking differ from systems thinking when diagnosing issues?

Show answer

Component thinking: "The database is slow" and "The API has high latency" — two separate problems. Systems thinking: "The API retries on database timeouts, increasing database load, making it slower, causing more retries" — one feedback loop, one problem, fix the loop.

Remember: "Feedback loops: positive = amplifying, negative = stabilizing." A thermostat is a negative feedback loop (stabilizes temperature). Viral growth is positive (amplifying).

Gotcha: "Positive" doesn't mean good — it means self-reinforcing.

3. What is the difference between tight and loose coupling in infrastructure?

Show answer

Tight coupling: Service A directly calls Service B synchronously — A breaks when B breaks. Loose coupling: Service A puts a message on a queue, Service B reads when ready — A survives B's failure. The queue absorbs the shock.

🟡 Medium (4)¶

1. Why can adding retries to fix a 10% error rate actually cause a total outage?

Show answer

If Service B is failing because it's overloaded, adding 3 retries to Service A increases load on B by up to 3x. B's failure rate goes from 10% to 40%, then with retries the effective load becomes 4x, and B crashes completely. The fix was correct in component thinking but catastrophic in systems thinking. Use a retry budget instead: only retry if total retry rate is below 10% of requests.

2. What is emergent behavior and how does the "thundering herd" illustrate it?

Show answer

Emergent behavior is system-level behavior no individual component was designed to produce. Thundering herd: each server individually does "when cache is empty, fetch from database" — correct for one server. But 100 servers simultaneously discover the cache is empty, fire 100 identical queries, and collapse the database. No single server did anything wrong; the failure emerged from correct individual behaviors at scale.

3. Explain Little's Law and why a small latency increase can cause a total outage.

Show answer

L = lambda * W (concurrent requests = arrival rate x average latency). At 100 RPS and 200ms latency, L=20. If latency doubles to 400ms, L=40 — you need twice the connection pool. At 2 seconds latency, L=200 — a pool of 50 is exhausted, requests queue, latency increases further, creating a positive feedback loop that kills the system.

4. Describe the typical cascade that takes a system from a single database lock to total user-facing failure.

Show answer

(1) Database runs a long query with table lock. (2) API connections queue. (3) Connection pool exhausts. (4) API returns 503s. (5) Load balancer marks API unhealthy. (6) Traffic shifts to remaining instances. (7) They get 2x traffic and also exhaust. (8) All API instances down. (9) Users refresh, adding more traffic. This can take under 60 seconds.

🔴 Hard (3)¶

1. How do circuit breakers prevent cascading failures, and what are their three states?

Show answer

Circuit breakers sit between client and service. Three states: CLOSED (normal, requests pass through), OPEN (errors exceeded threshold, requests fail immediately without calling the backend — prevents pile-up), HALF-OPEN (after cooldown, lets one test request through; if it succeeds, closes the circuit; if it fails, stays open). This prevents cascading failures by failing fast instead of amplifying load.

2. Why do systems become nonlinear near capacity, and what is the critical insight about utilization?

Show answer

Queue theory shows that latency is not linear with load. Going from 70% to 80% utilization might add 10ms. From 80% to 90% adds 100ms. From 90% to 95% adds 500ms. Most production systems live at 70-80% capacity, where small load increases cause disproportionate latency spikes. This is why capacity planning must leave headroom.

3. What is "fix-induced failure" and what six questions should you ask before implementing a fix?

Show answer

Fix-induced failure is when each fix solves the local problem but creates a new one elsewhere (e.g., increase memory limit -> fewer pods per node -> can't scale -> add nodes -> more DNS queries -> CoreDNS overload). Before fixing, ask: (1) What else depends on what I'm changing? (2) What are second-order effects? (3) Am I treating symptom or cause? (4) Will this work at 2x scale? (5) Am I tightening or loosening coupling? (6) Am I adding capacity or reducing demand?