Mental Model Library¶
Thinking frameworks for production systems — the models that experienced engineers use to reason about behavior, diagnose problems, and make decisions under pressure.
These are not tool-specific skills. They transfer across any stack.
Index¶
| Category | Model | One-Liner | When to Use |
|---|---|---|---|
| Debugging | USE Method | Check Utilization, Saturation, Errors for every resource | First response to performance or capacity issues |
| Debugging | RED Method | Measure Rate, Errors, Duration for every service | Diagnosing microservice or API problems |
| Debugging | Five Whys | Ask "why" five times to reach root cause | Postmortems, recurring incidents, systemic bugs |
| Debugging | Bisect | Binary search for the breaking change | Finding which deploy, config, or commit caused a regression |
| Debugging | Differential Diagnosis | Enumerate and eliminate hypotheses systematically | Any ambiguous failure with multiple possible causes |
| Debugging | Correlation vs Causation | The deploy preceded the outage — but did it cause it? | Establishing causation under time pressure |
| System Behavior | Little's Law | L = λW — concurrency equals arrival rate × latency | Sizing thread pools, explaining latency spikes, capacity planning |
| System Behavior | Amdahl's Law | Speedup is bounded by the serial fraction | Evaluating parallelism investments, finding bottlenecks |
| System Behavior | CAP Theorem | Consistency, Availability, Partition Tolerance — pick two | Choosing databases, understanding distributed system trade-offs |
| System Behavior | PACELC | Extends CAP: during partitions choose A/C, else choose L/C | More nuanced DB selection and replication design |
| System Behavior | Queueing Theory | Queue length explodes non-linearly as utilization approaches 100% | Capacity planning, autoscaling targets, thread pool sizing |
| System Behavior | Swiss Cheese Model | Failures need aligned holes through multiple defense layers | Understanding why incidents happen despite multiple safeguards |
| System Behavior | Blast Radius | How far does a single failure propagate? | Change management, permissions design, staged rollouts |
| System Behavior | Failure Domains | What fails together should fail together — and nothing else | Availability zone design, rack layout, network segmentation |
| System Behavior | Graceful Degradation | Shed load to preserve core function | Designing for overload; distinguishing must-have from nice-to-have |
| Operational Reasoning | OODA Loop | Observe, Orient, Decide, Act — tight loops shorten MTTR | Incident response cadence, on-call discipline |
| Operational Reasoning | Blameless Postmortem | Treat incidents as system failures, not human failures | Post-incident review; building learning organizations |
| Operational Reasoning | Error Budget | SLO math defines how much unreliability you can afford | Release vs reliability trade-off decisions |
| Operational Reasoning | Toil vs Automation ROI | Manual repetitive work that scales linearly is toil — automate it | Deciding when to automate vs accept operational cost |
| Operational Reasoning | Runbook-Driven Recovery | Codified response beats improvised heroics every time | Alert response design; reducing cognitive load under pressure |
| Operational Reasoning | Immutable Infrastructure | Replace, don't repair — eliminate configuration drift | Container and VM lifecycle, Terraform workflows |
| Operational Reasoning | Cattle vs Pets | Disposable numbered instances vs precious named servers | Cloud-native architecture mindset; disaster recovery |
| Operational Reasoning | Shift Left | Find problems in dev, not prod — cost grows exponentially later | CI/CD pipeline design, security and quality strategy |
| Architecture | 12-Factor App | The portable, scalable cloud-native application checklist | Designing or evaluating service deployability |
| Architecture | Strangler Fig | Incrementally migrate from legacy by building around it | Legacy modernization without big-bang rewrites |
| Architecture | Circuit Breaker | Fail fast to prevent cascade failures | Service-to-service calls; protecting downstream dependencies |
| Architecture | Bulkhead | Isolate resource pools so one consumer can't sink others | Multi-tenant systems; mixed-criticality service fan-out |
| Architecture | Sidecar Pattern | Attach functionality to a service without modifying it | Logging, metrics, proxy, auth — cross-cutting concerns |
| Architecture | Event Sourcing | Append-only event log as the source of truth | Audit trails, temporal queries, CQRS systems |
| Architecture | Idempotency | Safe to retry without side effects | Any at-least-once delivery system; automation, APIs, Terraform |
| Human Factors | Normalization of Deviance | Small violations become normalized when consequences are delayed | Explaining why known risks go unaddressed; cultural audits |
| Human Factors | Alert Fatigue | Too many alerts → all alerts ignored | Alerting strategy; explaining pager numbness |
| Human Factors | Hindsight Bias | The incident looks obvious only after you know the outcome | Postmortem facilitation; operator blame prevention |
| Human Factors | Automation Complacency | Trusting automation without verification degrades situational awareness | Reviewing automated pipelines; runbook testing |
How to Use This Library¶
In an incident¶
Reach for debugging models first (USE, RED, Differential Diagnosis), then system behavior models to explain why the system behaved that way.
In a postmortem¶
Use Blameless Postmortem as the frame. Apply Hindsight Bias and Normalization of Deviance lenses when reviewing contributing factors. Five Whys for iteration.
In design review¶
Architecture models (Circuit Breaker, Bulkhead, Blast Radius, Failure Domains) apply before you build. Queueing Theory and Little's Law give you numbers.
In a 1:1 or career conversation¶
Human Factors models explain why smart people make systematic errors. Operational Reasoning models explain what separates junior from senior SRE thinking.
Categories¶
| Category | Count | Focus |
|---|---|---|
| Debugging & Diagnosis | 6 | Finding the cause of problems |
| System Behavior | 9 | Understanding how systems act under load and failure |
| Operational Reasoning | 8 | How to run and improve systems over time |
| Architecture & Design | 7 | Structural patterns for resilience and maintainability |
| Human Factors | 4 | How people interact with complex systems |
Connections Map¶
The models are not independent — they form a web:
USE Method ──────────────────── RED Method
│ │
│ (infra vs service lens) │
└───────────┬───────────────────┘
│
Queueing Theory ──── Little's Law ──── Amdahl's Law
│
Graceful Degradation ── Circuit Breaker ── Bulkhead
│
Failure Domains ──── Blast Radius ──── Swiss Cheese Model
│
Normalization of Deviance
│
Alert Fatigue ── Hindsight Bias
│
Blameless Postmortem ── Five Whys
│
OODA Loop ── Error Budget
Cross-References¶
Case studies that exercise multiple models: - node-pressure-evictions — USE Method, Queueing Theory, Graceful Degradation, Cattle vs Pets - disk-full-root-services-down — Swiss Cheese Model, Normalization of Deviance, Runbook-Driven Recovery - firmware-update-boot-loop — Swiss Cheese Model, Idempotency, Blameless Postmortem, Automation Complacency - coredns-timeout-pod-dns — RED Method, Circuit Breaker, CAP Theorem - firewall-shadow-rule — Differential Diagnosis, Correlation vs Causation, Hindsight Bias
Pages that link here¶
- Architecture & Design Models
- Debugging & Diagnosis Models
- Firewall Shadow Rule
- Firmware Update Boot Loop
- Human Factors Models
- Mental Model: 12-Factor App
- Mental Model: Alert Fatigue
- Mental Model: Amdahl's Law
- Mental Model: Automation Complacency
- Mental Model: Bisect
- Mental Model: Blameless Postmortem
- Mental Model: Blast Radius
- Mental Model: Bulkhead
- Mental Model: CAP Theorem
- Mental Model: Cattle vs Pets