Mental Model Library¶

Thinking frameworks for production systems — the models that experienced engineers use to reason about behavior, diagnose problems, and make decisions under pressure.

These are not tool-specific skills. They transfer across any stack.

Index¶

Category	Model	One-Liner	When to Use
Debugging	USE Method	Check Utilization, Saturation, Errors for every resource	First response to performance or capacity issues
Debugging	RED Method	Measure Rate, Errors, Duration for every service	Diagnosing microservice or API problems
Debugging	Five Whys	Ask "why" five times to reach root cause	Postmortems, recurring incidents, systemic bugs
Debugging	Bisect	Binary search for the breaking change	Finding which deploy, config, or commit caused a regression
Debugging	Differential Diagnosis	Enumerate and eliminate hypotheses systematically	Any ambiguous failure with multiple possible causes
Debugging	Correlation vs Causation	The deploy preceded the outage — but did it cause it?	Establishing causation under time pressure
System Behavior	Little's Law	L = λW — concurrency equals arrival rate × latency	Sizing thread pools, explaining latency spikes, capacity planning
System Behavior	Amdahl's Law	Speedup is bounded by the serial fraction	Evaluating parallelism investments, finding bottlenecks
System Behavior	CAP Theorem	Consistency, Availability, Partition Tolerance — pick two	Choosing databases, understanding distributed system trade-offs
System Behavior	PACELC	Extends CAP: during partitions choose A/C, else choose L/C	More nuanced DB selection and replication design
System Behavior	Queueing Theory	Queue length explodes non-linearly as utilization approaches 100%	Capacity planning, autoscaling targets, thread pool sizing
System Behavior	Swiss Cheese Model	Failures need aligned holes through multiple defense layers	Understanding why incidents happen despite multiple safeguards
System Behavior	Blast Radius	How far does a single failure propagate?	Change management, permissions design, staged rollouts
System Behavior	Failure Domains	What fails together should fail together — and nothing else	Availability zone design, rack layout, network segmentation
System Behavior	Graceful Degradation	Shed load to preserve core function	Designing for overload; distinguishing must-have from nice-to-have
Operational Reasoning	OODA Loop	Observe, Orient, Decide, Act — tight loops shorten MTTR	Incident response cadence, on-call discipline
Operational Reasoning	Blameless Postmortem	Treat incidents as system failures, not human failures	Post-incident review; building learning organizations
Operational Reasoning	Error Budget	SLO math defines how much unreliability you can afford	Release vs reliability trade-off decisions
Operational Reasoning	Toil vs Automation ROI	Manual repetitive work that scales linearly is toil — automate it	Deciding when to automate vs accept operational cost
Operational Reasoning	Runbook-Driven Recovery	Codified response beats improvised heroics every time	Alert response design; reducing cognitive load under pressure
Operational Reasoning	Immutable Infrastructure	Replace, don't repair — eliminate configuration drift	Container and VM lifecycle, Terraform workflows
Operational Reasoning	Cattle vs Pets	Disposable numbered instances vs precious named servers	Cloud-native architecture mindset; disaster recovery
Operational Reasoning	Shift Left	Find problems in dev, not prod — cost grows exponentially later	CI/CD pipeline design, security and quality strategy
Architecture	12-Factor App	The portable, scalable cloud-native application checklist	Designing or evaluating service deployability
Architecture	Strangler Fig	Incrementally migrate from legacy by building around it	Legacy modernization without big-bang rewrites
Architecture	Circuit Breaker	Fail fast to prevent cascade failures	Service-to-service calls; protecting downstream dependencies
Architecture	Bulkhead	Isolate resource pools so one consumer can't sink others	Multi-tenant systems; mixed-criticality service fan-out
Architecture	Sidecar Pattern	Attach functionality to a service without modifying it	Logging, metrics, proxy, auth — cross-cutting concerns
Architecture	Event Sourcing	Append-only event log as the source of truth	Audit trails, temporal queries, CQRS systems
Architecture	Idempotency	Safe to retry without side effects	Any at-least-once delivery system; automation, APIs, Terraform
Human Factors	Normalization of Deviance	Small violations become normalized when consequences are delayed	Explaining why known risks go unaddressed; cultural audits
Human Factors	Alert Fatigue	Too many alerts → all alerts ignored	Alerting strategy; explaining pager numbness
Human Factors	Hindsight Bias	The incident looks obvious only after you know the outcome	Postmortem facilitation; operator blame prevention
Human Factors	Automation Complacency	Trusting automation without verification degrades situational awareness	Reviewing automated pipelines; runbook testing

How to Use This Library¶

In an incident¶

Reach for debugging models first (USE, RED, Differential Diagnosis), then system behavior models to explain why the system behaved that way.

In a postmortem¶

Use Blameless Postmortem as the frame. Apply Hindsight Bias and Normalization of Deviance lenses when reviewing contributing factors. Five Whys for iteration.

In design review¶

Architecture models (Circuit Breaker, Bulkhead, Blast Radius, Failure Domains) apply before you build. Queueing Theory and Little's Law give you numbers.

In a 1:1 or career conversation¶

Human Factors models explain why smart people make systematic errors. Operational Reasoning models explain what separates junior from senior SRE thinking.

Categories¶

Category	Count	Focus
Debugging & Diagnosis	6	Finding the cause of problems
System Behavior	9	Understanding how systems act under load and failure
Operational Reasoning	8	How to run and improve systems over time
Architecture & Design	7	Structural patterns for resilience and maintainability
Human Factors	4	How people interact with complex systems

Connections Map¶

The models are not independent — they form a web:

 USE Method ──────────────────── RED Method
     │                               │
     │  (infra vs service lens)       │
     └───────────┬───────────────────┘
                 │
          Queueing Theory ──── Little's Law ──── Amdahl's Law
                 │
          Graceful Degradation ── Circuit Breaker ── Bulkhead
                 │
          Failure Domains ──── Blast Radius ──── Swiss Cheese Model
                                                       │
                                              Normalization of Deviance
                                                       │
                                              Alert Fatigue ── Hindsight Bias
                                                       │
                                              Blameless Postmortem ── Five Whys
                                                       │
                                              OODA Loop ── Error Budget

Cross-References¶

Case studies that exercise multiple models: - node-pressure-evictions — USE Method, Queueing Theory, Graceful Degradation, Cattle vs Pets - disk-full-root-services-down — Swiss Cheese Model, Normalization of Deviance, Runbook-Driven Recovery - firmware-update-boot-loop — Swiss Cheese Model, Idempotency, Blameless Postmortem, Automation Complacency - coredns-timeout-pod-dns — RED Method, Circuit Breaker, CAP Theorem - firewall-shadow-rule — Differential Diagnosis, Correlation vs Causation, Hindsight Bias