Systems Thinking for Engineers - Street-Level Ops¶

What experienced SREs know about system behavior that textbooks explain in theory but production teaches in pain.

Quick Diagnosis Commands¶

# When you suspect a feedback loop (cascading failure in progress):

# Check retry rates (are retries amplifying load?)
# In Prometheus/Grafana, query:
# rate(http_client_requests_total{status=~"5.."}[5m])
# vs rate(http_client_requests_total[5m])
# If retry rate > 20% of total requests, you have a retry storm

# Check connection pool saturation
ss -s                                    # Socket summary (total connections)
ss -tn state established | wc -l        # Active connections
ss -tn state time-wait | wc -l          # Connection churn

# Check queue depths (are queues growing unbounded?)
# RabbitMQ:
rabbitmqctl list_queues name messages consumers
# Redis:
redis-cli llen <queue-name>

# Check thread pool exhaustion
# Java:
jstack <pid> | grep -c "WAITING\|BLOCKED\|TIMED_WAITING"
# Generic:
cat /proc/<pid>/status | grep Threads

# System-wide saturation signals
vmstat 1 5                               # CPU, memory, I/O wait
iostat -x 1 5                            # Disk saturation
sar -n DEV 1 5                           # Network saturation

Gotcha: The Cache That Made Things Worse¶

You added a Redis cache in front of your database to reduce load. Latency improved. Six months later, Redis goes down. All traffic hits the database simultaneously. The database handled this traffic before the cache existed, but now it can't — because the cache hid the fact that traffic grew 5x. The database was never scaled because the cache masked the growth.

Fix: Cache failures must be survivable. Test cache failure regularly (chaos engineering).

Remember: A cache does not reduce load — it hides load. The underlying system must be sized to survive a cache miss storm, or the cache becomes a single point of failure that is worse than having no cache at all. Track the ratio of cache-hit traffic to direct-database traffic. Set database capacity to handle at least a percentage of cache-miss scenarios. Use circuit breakers so that a cache failure degrades gracefully (stale data, slower responses) instead of crashing the database.

Gotcha: The Autoscaler That Oscillated¶

Your HPA scales pods based on CPU utilization. Target: 50%. Traffic spike → CPU hits 70% → HPA adds 5 pods → CPU drops to 30% → HPA removes 3 pods → CPU hits 60% → HPA adds 2 pods → repeat. Your pod count oscillates every 2 minutes. Each scale event disrupts in-flight requests.

Fix: This is a classic feedback loop stability problem. The system is oscillating because the response (scaling) is too fast relative to the measurement (CPU average). Tune the HPA:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300     # Wait 5 min before scaling down
    policies:
      - type: Percent
        value: 10                        # Remove max 10% of pods at a time
        periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 50                        # Add max 50% more pods at a time
        periodSeconds: 60

Scale up aggressively, scale down conservatively. The asymmetry prevents oscillation.

Gotcha: The Timeout Chain Nobody Mapped¶

Service A (timeout: 30s) calls Service B (timeout: 30s) calls Service C (timeout: 30s). Under normal conditions, each call takes 50ms. Nobody thinks about timeouts until Service C gets slow. Now Service B waits 30s for C, and Service A waits 30s for B. A single slow endpoint causes a 60-second hang for every request through the chain. Meanwhile, connections stack up at every layer.

Fix: Map your timeout chain and enforce a hierarchy:

Service A timeout: 10s  (outermost — shortest)
  └── Service B timeout: 5s  (inner — shorter)
        └── Service C timeout: 2s  (innermost — shortest)

Rule: each layer's timeout must be LESS than its caller's timeout.
Otherwise the caller times out while the callee is still working,
wasting resources on a request that's already dead.

> **Under the hood:** When a caller times out before its callee, the callee completes the work but the response is discarded — this is called "wasted work." Under high load, wasted work compounds: the system burns CPU on requests that are already dead upstream, reducing capacity for live requests and accelerating the cascade.

Gotcha: Load Balancer Healthy But Service Degraded¶

Your health check returns 200. The load balancer says all instances are healthy. But users experience 5-second response times because the service is overwhelmed — it's accepting new connections (healthy) but processing them slowly (degraded). The health check doesn't test actual request processing.

Fix: Implement a meaningful readiness check, not just a liveness check:

Liveness:  "Is the process running?"    → restart if no
Readiness: "Can this instance handle    → remove from LB if no
            new requests effectively?"

A good readiness check verifies:
├── Database connection pool has available connections
├── Response time for a test query is < threshold
├── Queue depth is < threshold
├── Memory usage is < 90%
└── Dependent services are reachable

When the readiness check fails, the load balancer stops sending traffic. The instance can recover without being killed.

Gotcha: The Fix That Shifted the Bottleneck¶

Database was the bottleneck. You added read replicas. Database is no longer the bottleneck. Now the network between API and database is the bottleneck. You optimize the network. Now the API CPU is the bottleneck. Every system has exactly one bottleneck. Fixing it reveals the next one.

Fix: This is normal — it's called the Theory of Constraints. Don't be surprised when fixing one bottleneck reveals another. The goal isn't to eliminate all bottlenecks (impossible). The goal is to ensure the bottleneck is in the place you choose, at a capacity you can control. Before optimizing, ask: "If I fix this, where does the bottleneck move?"

One-liner: Every system has exactly one bottleneck. Fix it, and you promote the second-worst component to the new bottleneck. This is not a failure of engineering — it is the fundamental nature of systems. Plan for it.

Pattern: The Cascading Failure Circuit Breaker¶

When you see a cascade forming in real time:

Minute 0: Service C latency increases from 50ms to 500ms
Minute 1: Service B connection pool filling up (70%)
Minute 2: Service B connection pool exhausted (100%)
          Service B starts returning 503s
Minute 3: Service A retries → 3x load on B → B crashes
Minute 4: All downstream services affected

BREAK THE CASCADE:

Option 1: Shed load at the edge
  # Rate limit at the load balancer
  # Reject excess requests with 429 before they enter the system

Option 2: Open the circuit breaker
  # Service A stops calling B entirely
  # Returns degraded response (cached data, default values, error)
  # B gets breathing room to recover

Option 3: Reduce the blast radius
  # If C is the root cause, isolate C
  # Service B should timeout on C quickly (2s, not 30s)
  # Service B returns partial results without C's data

Option 4: Drain the queues
  # If queues are full, the system is processing yesterday's load
  # Sometimes you have to drop queued messages to recover
  # Serve current users, not the backlog

Pattern: Identifying Feedback Loops in Your Architecture¶

Draw your system diagram. For every arrow, ask: "Is there a return path?"

Step 1: Map dependencies
  A → B → C → Database

Step 2: Add retry/timeout behaviors
  A →(3 retries)→ B →(3 retries)→ C →(3 retries)→ Database

Step 3: Calculate amplification
  1 user request can generate: 3 × 3 × 3 = 27 database queries

Step 4: Add indirect feedback
  Database slow → C slow → B retries → Database gets 3x load
  → Database slower → C slower → B retries more → Database crashes

Step 5: Identify breaking points
  Circuit breaker between A and B: limits cascade to A's layer
  Rate limit on database: protects the root resource
  Timeout hierarchy: A=10s, B=5s, C=2s: prevents stack-up

Pattern: Capacity Planning with Little's Law¶

Use Little's Law to reason about capacity before you need it:

Current state:
  Throughput: 500 RPS
  Latency: 100ms (P50)
  Concurrency: 500 × 0.1 = 50 concurrent requests
  Connection pool: 100 (50% utilized — comfortable)

What happens at 2x traffic?
  Throughput: 1000 RPS
  If latency stays 100ms: concurrency = 100 (pool at 100%)
  But latency won't stay 100ms at higher load...
  Realistic latency at 2x load: ~150ms (queuing effects)
  Actual concurrency: 1000 × 0.15 = 150
  Connection pool: EXHAUSTED (150 > 100)
  Result: requests queue, latency spikes, system degrades

Pre-emptive fix:
  Increase connection pool to 200
  OR add capacity (more instances) to keep latency at 100ms
  OR shed load (rate limiting) to stay at 500 RPS

Pattern: Asking "And Then What?"¶

Practice this for every proposed change:

Proposed: "Let's add a 30-second cache TTL to reduce database load"

And then what?
├── Cache hit ratio improves → database load drops 60% ✓
├── But: users see stale data for up to 30 seconds
│   └── And then: user updates a record, refreshes, sees old data
│       └── And then: user reports a "bug," support tickets increase
├── But: cache invalidation on write is now needed
│   └── And then: we need a pub/sub system for cache invalidation
│       └── And then: another dependency that can fail
├── But: cache thundering herd when TTL expires
│   └── And then: all instances fetch from DB simultaneously
│       └── And then: database spike every 30 seconds
│           └── Fix: stagger TTLs with jitter
└── But: cold start after deployment (all caches empty)
    └── And then: deployment causes a database load spike
        └── Fix: cache warming on startup

The cache is still the right call — but asking "and then what?" surfaces five additional concerns you can address proactively instead of reactively.

Gotcha: "And then what?" is the cheapest form of chaos engineering. It costs zero infrastructure, takes five minutes, and catches failure modes that would otherwise surface at 3 AM. Apply it to every design review, every production change, every incident mitigation.

Emergency: Cascading Failure in Progress¶

Everything is going down. Services are timing out. Alerts are firing everywhere. The system is in a positive feedback loop of failure.

1. STOP RETRIES
   - Enable circuit breakers on all upstream services
   - If no circuit breakers: manually reduce traffic at the edge
   - Rate limit or block at the load balancer

2. IDENTIFY THE ROOT
   - What failed FIRST? (check alert timeline)
   - Often it's the deepest dependency (database, DNS, storage)
   - Everything else is a symptom of the root failure

3. SHED LOAD
   - Return 503 at the edge for non-critical traffic
   - Disable non-essential features (analytics, recommendations)
   - Serve cached/static responses where possible

4. FIX THE ROOT
   - Scale the bottleneck resource
   - Restart the failed component
   - Failover to backup

5. RECOVER GRADUALLY
   - Don't restore all traffic at once
   - The system needs time to warm caches, fill pools, stabilize
   - Increase traffic in steps: 25% → 50% → 75% → 100%
   - Watch metrics at each step before proceeding

6. POST-INCIDENT
   - Map the full cascade chain
   - Identify every missing circuit breaker
   - Add timeout hierarchies
   - Plan chaos experiments to test the fixes

Emergency: The System Is Oscillating¶

Auto-scaler is fighting with itself. Pod count goes up and down. Response times are unstable. Alerts fire and resolve in cycles.

1. Stabilize immediately:
   - Set HPA to a fixed replica count (disable auto-scaling temporarily)
   - kubectl scale deployment/<name> --replicas=<current-peak-count>

2. Let the system settle (5-10 minutes)

3. Identify the oscillation cause:
   - Metric lag: scaling based on a metric that's too delayed
   - Thrashing: scale-up/down thresholds too close together
   - External cycle: traffic pattern with regular spikes
   - Dependent oscillation: scaling one service changes load on another

4. Fix the feedback loop:
   - Add stabilization windows (scaleDown: 5min)
   - Increase the gap between scale-up and scale-down thresholds
   - Use multiple metrics (CPU + RPS) to make scaling decisions
   - Scale asymmetrically: fast up, slow down

5. Re-enable autoscaling with the new parameters
   - Monitor for 24 hours before declaring it fixed