Skip to content

Mental Model Library

Thinking frameworks for production systems — the models that experienced engineers use to reason about behavior, diagnose problems, and make decisions under pressure.

These are not tool-specific skills. They transfer across any stack.


Index

Category Model One-Liner When to Use
Debugging USE Method Check Utilization, Saturation, Errors for every resource First response to performance or capacity issues
Debugging RED Method Measure Rate, Errors, Duration for every service Diagnosing microservice or API problems
Debugging Five Whys Ask "why" five times to reach root cause Postmortems, recurring incidents, systemic bugs
Debugging Bisect Binary search for the breaking change Finding which deploy, config, or commit caused a regression
Debugging Differential Diagnosis Enumerate and eliminate hypotheses systematically Any ambiguous failure with multiple possible causes
Debugging Correlation vs Causation The deploy preceded the outage — but did it cause it? Establishing causation under time pressure
System Behavior Little's Law L = λW — concurrency equals arrival rate × latency Sizing thread pools, explaining latency spikes, capacity planning
System Behavior Amdahl's Law Speedup is bounded by the serial fraction Evaluating parallelism investments, finding bottlenecks
System Behavior CAP Theorem Consistency, Availability, Partition Tolerance — pick two Choosing databases, understanding distributed system trade-offs
System Behavior PACELC Extends CAP: during partitions choose A/C, else choose L/C More nuanced DB selection and replication design
System Behavior Queueing Theory Queue length explodes non-linearly as utilization approaches 100% Capacity planning, autoscaling targets, thread pool sizing
System Behavior Swiss Cheese Model Failures need aligned holes through multiple defense layers Understanding why incidents happen despite multiple safeguards
System Behavior Blast Radius How far does a single failure propagate? Change management, permissions design, staged rollouts
System Behavior Failure Domains What fails together should fail together — and nothing else Availability zone design, rack layout, network segmentation
System Behavior Graceful Degradation Shed load to preserve core function Designing for overload; distinguishing must-have from nice-to-have
Operational Reasoning OODA Loop Observe, Orient, Decide, Act — tight loops shorten MTTR Incident response cadence, on-call discipline
Operational Reasoning Blameless Postmortem Treat incidents as system failures, not human failures Post-incident review; building learning organizations
Operational Reasoning Error Budget SLO math defines how much unreliability you can afford Release vs reliability trade-off decisions
Operational Reasoning Toil vs Automation ROI Manual repetitive work that scales linearly is toil — automate it Deciding when to automate vs accept operational cost
Operational Reasoning Runbook-Driven Recovery Codified response beats improvised heroics every time Alert response design; reducing cognitive load under pressure
Operational Reasoning Immutable Infrastructure Replace, don't repair — eliminate configuration drift Container and VM lifecycle, Terraform workflows
Operational Reasoning Cattle vs Pets Disposable numbered instances vs precious named servers Cloud-native architecture mindset; disaster recovery
Operational Reasoning Shift Left Find problems in dev, not prod — cost grows exponentially later CI/CD pipeline design, security and quality strategy
Architecture 12-Factor App The portable, scalable cloud-native application checklist Designing or evaluating service deployability
Architecture Strangler Fig Incrementally migrate from legacy by building around it Legacy modernization without big-bang rewrites
Architecture Circuit Breaker Fail fast to prevent cascade failures Service-to-service calls; protecting downstream dependencies
Architecture Bulkhead Isolate resource pools so one consumer can't sink others Multi-tenant systems; mixed-criticality service fan-out
Architecture Sidecar Pattern Attach functionality to a service without modifying it Logging, metrics, proxy, auth — cross-cutting concerns
Architecture Event Sourcing Append-only event log as the source of truth Audit trails, temporal queries, CQRS systems
Architecture Idempotency Safe to retry without side effects Any at-least-once delivery system; automation, APIs, Terraform
Human Factors Normalization of Deviance Small violations become normalized when consequences are delayed Explaining why known risks go unaddressed; cultural audits
Human Factors Alert Fatigue Too many alerts → all alerts ignored Alerting strategy; explaining pager numbness
Human Factors Hindsight Bias The incident looks obvious only after you know the outcome Postmortem facilitation; operator blame prevention
Human Factors Automation Complacency Trusting automation without verification degrades situational awareness Reviewing automated pipelines; runbook testing

How to Use This Library

In an incident

Reach for debugging models first (USE, RED, Differential Diagnosis), then system behavior models to explain why the system behaved that way.

In a postmortem

Use Blameless Postmortem as the frame. Apply Hindsight Bias and Normalization of Deviance lenses when reviewing contributing factors. Five Whys for iteration.

In design review

Architecture models (Circuit Breaker, Bulkhead, Blast Radius, Failure Domains) apply before you build. Queueing Theory and Little's Law give you numbers.

In a 1:1 or career conversation

Human Factors models explain why smart people make systematic errors. Operational Reasoning models explain what separates junior from senior SRE thinking.


Categories

Category Count Focus
Debugging & Diagnosis 6 Finding the cause of problems
System Behavior 9 Understanding how systems act under load and failure
Operational Reasoning 8 How to run and improve systems over time
Architecture & Design 7 Structural patterns for resilience and maintainability
Human Factors 4 How people interact with complex systems

Connections Map

The models are not independent — they form a web:

 USE Method ──────────────────── RED Method
     │                               │
     │  (infra vs service lens)       │
     └───────────┬───────────────────┘
          Queueing Theory ──── Little's Law ──── Amdahl's Law
          Graceful Degradation ── Circuit Breaker ── Bulkhead
          Failure Domains ──── Blast Radius ──── Swiss Cheese Model
                                              Normalization of Deviance
                                              Alert Fatigue ── Hindsight Bias
                                              Blameless Postmortem ── Five Whys
                                              OODA Loop ── Error Budget

Cross-References

Case studies that exercise multiple models: - node-pressure-evictions — USE Method, Queueing Theory, Graceful Degradation, Cattle vs Pets - disk-full-root-services-down — Swiss Cheese Model, Normalization of Deviance, Runbook-Driven Recovery - firmware-update-boot-loop — Swiss Cheese Model, Idempotency, Blameless Postmortem, Automation Complacency - coredns-timeout-pod-dns — RED Method, Circuit Breaker, CAP Theorem - firewall-shadow-rule — Differential Diagnosis, Correlation vs Causation, Hindsight Bias