Cross-Domain Incident Case Studies¶
20 complex incidents that each span exactly 3 domains. Each incident starts with a misleading symptom in Domain A, requires investigation in Domain B to find the root cause, and needs remediation in Domain C.
Incident Matrix¶
Domain Coverage¶
| Domain | Appearances | Incidents |
|---|---|---|
| kubernetes_ops | 10 | 1, 2, 4, 6, 7, 9, 10, 11, 14, 15, 16, 17, 18, 20 |
| networking | 7 | 1, 5, 7, 8, 11, 12, 14, 19, 20 |
| observability | 7 | 2, 3, 5, 7, 10, 13, 14, 18 |
| devops_tooling | 7 | 2, 3, 4, 6, 9, 13, 15, 19, 20 |
| security | 5 | 1, 5, 6, 11, 15, 17 |
| linux_ops | 6 | 3, 8, 9, 10, 12, 16, 18, 19 |
| cloud | 3 | 8, 13, 17 |
| datacenter_ops | 3 | 4, 12, 16 |
How to Use¶
Each incident directory contains 5 files:
symptoms.md— Start here. Read the initial alert and observable symptoms. Try to form a hypothesis before continuing.questions.md— 5 diagnostic questions to test your reasoning. Answer these before reading the investigation.investigation.md— The full investigation path: Domain A dead end, the pivot clue, and Domain B root cause.remediation.md— The fix in Domain C, verification across all 3 domains, and prevention measures.grading.md— Rubric for self-assessment and prerequisite topic packs.
Difficulty Levels¶
- L2 (12 incidents): Requires solid fundamentals in all 3 domains. A mid-level engineer with 2-3 years experience should be able to solve these with some guidance.
- L3 (8 incidents): Requires deep knowledge and experience with cross-domain failure modes. Senior/staff-level troubleshooting.
Pages that link here¶
- Case Studies
- Production Readiness Review: Study Plans
- Symptoms: API Latency Spike, BGP Route Leak, Fix Is Network ACL
- Symptoms: Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning
- Symptoms: Ansible Playbook Hangs, SSH Agent Forwarding Broken, Root Cause Is Firewall Rule
- Symptoms: Backup Job Failing, iSCSI Target Unreachable, Fix Is VLAN Config
- Symptoms: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC
- Symptoms: Canary Deploy Looks Healthy, Actually Routing to Wrong Backend, Ingress Misconfigured
- Symptoms: Container Image Vuln Scanner False Positive, Blocks Deploy Pipeline
- Symptoms: DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager
- Symptoms: Database Replication Lag, Root Cause Is RAID Degradation
- Symptoms: Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation
- Symptoms: Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention
- Symptoms: Grafana Dashboard Empty, Prometheus Scrape Blocked by NetworkPolicy
- Symptoms: HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config