Decision Trees¶
Flowcharts for common operational decisions. Start at the top, follow the branches, arrive at the right action.
Designed for: incident response, architecture decisions, operational judgment calls, and security response. Pull one up during an incident, pin one to the wall, or work through one during an on-call rotation.
How to read: Each node is a check. Yes/No branches lead to the next check or a terminal action (✅ do this) or escalation (⚠️ get help).
Incident Triage¶
Start here when something is broken and you need to find the cause fast.
| Tree | Starting Question |
|---|---|
| 5xx Errors | "My service is returning 5xx errors" |
| Latency Spike | "Response latency has increased" |
| Pod Won't Start | "A pod is stuck and won't start" |
| Node NotReady | "A Kubernetes node is in NotReady state" |
| Alert Fired | "An alert fired — is this real?" |
| Disk Filling Up | "Disk usage is high or growing" |
| Memory Usage High | "Memory usage is high — why?" |
| Deployment Stuck | "A deployment is stuck and won't roll out" |
Architecture Decisions¶
Start here when you're choosing a design approach and want to avoid cargo-culting the wrong pattern.
| Tree | Starting Question |
|---|---|
| Service Mesh | "Do we need a service mesh?" |
| Managed vs Self-Hosted | "Should we use managed or self-host?" |
| Monolith vs Microservices | "Should we decompose into microservices?" |
| Which Database | "What type of database should we use?" |
| Sync vs Async | "Should service A call service B synchronously?" |
| Where to Run | "Where should this workload run?" |
Operational Decisions¶
Start here when you need to make a judgment call under time pressure.
| Tree | Starting Question |
|---|---|
| Roll Back or Fix Forward | "Should I roll back or fix forward?" |
| Should I Page | "Should I wake someone up for this?" |
| Should I Automate | "Should I automate this manual task?" |
| Config Change | "How do I handle this config change safely?" |
| Cert Expiring | "A certificate is expiring — what do I do?" |
| Scale or Optimize | "Should I scale up or optimize first?" |
Security Response¶
Start here when you've found a security issue and need to decide how to act.
| Tree | Starting Question |
|---|---|
| Found a Vulnerability | "I found a security vulnerability" |
| Secret Exposed | "A secret was accidentally exposed" |
| Suspicious Activity | "Something looks like a security incident" |
| Container Running as Root | "A container is running as root — risk?" |
| Dependency CVE | "A dependency has a published CVE" |
Cross-References¶
These trees complement other training content:
- Runbooks — trees route you to the right runbook; runbooks tell you how to execute: runbooks/
- Topic Packs — trees reference topic packs for deeper reading: topics/
- Case Studies — trees encode the diagnostic paths used in incident case studies: case-studies/
Pages that link here¶
- Case Studies
- Decision Tree: A Secret Was Exposed
- Decision Tree: Alert Fired — Is This Real?
- Decision Tree: Certificate Is Expiring — What Do I Do?
- Decision Tree: Container Running as Root
- Decision Tree: Dependency Has a CVE
- Decision Tree: Deployment Is Stuck
- Decision Tree: Disk Is Filling Up
- Decision Tree: Do I Need a Service Mesh?
- Decision Tree: How to Handle This Config Change?
- Decision Tree: I Found a Vulnerability
- Decision Tree: Latency Has Increased
- Decision Tree: Managed vs Self-Hosted Service
- Decision Tree: Memory Usage Is High
- Decision Tree: Monolith vs Microservices