Skip to content

Cross-Domain Incident Case Studies

20 complex incidents that each span exactly 3 domains. Each incident starts with a misleading symptom in Domain A, requires investigation in Domain B to find the root cause, and needs remediation in Domain C.

Incident Matrix

# Incident Domain A (Symptom) Domain B (Root Cause) Domain C (Fix) Level
1 DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager networking security kubernetes_ops L2
2 Pod OOMKilled, Memory Leak Is in Sidecar, Fix Is Helm Values kubernetes_ops observability devops_tooling L2
3 Disk Full Alert, Cause Is Runaway Logs, Fix Is Loki Retention linux_ops observability devops_tooling L2
4 Node NotReady, NIC Firmware Bug, Fix Is Ansible Playbook kubernetes_ops datacenter_ops devops_tooling L3
5 API Latency Spike, BGP Route Leak, Fix Is Network ACL observability networking security L3
6 Deployment Stuck, ImagePull Auth Failure, Fix Is Vault Secret Rotation kubernetes_ops security devops_tooling L2
7 Grafana Dashboard Empty, Prometheus Scrape Blocked by NetworkPolicy observability kubernetes_ops networking L2
8 SSH Timeout, MTU Mismatch, Fix Is Terraform Variable linux_ops networking cloud L2
9 CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC devops_tooling linux_ops kubernetes_ops L2
10 HPA Flapping, Metrics Server Clock Skew, Fix Is NTP Config kubernetes_ops observability linux_ops L2
11 Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy networking kubernetes_ops security L3
12 Backup Job Failing, iSCSI Target Unreachable, Fix Is VLAN Config linux_ops datacenter_ops networking L3
13 Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle devops_tooling cloud observability L2
14 Alert Storm, Caused by Flapping Health Checks, Fix Is Probe Tuning observability networking kubernetes_ops L2
15 Container Image Vuln Scanner False Positive, Blocks Deploy Pipeline security devops_tooling kubernetes_ops L2
16 Database Replication Lag, Root Cause Is RAID Degradation kubernetes_ops linux_ops datacenter_ops L3
17 User Auth Failing, OIDC Cert Expired, Fix Is Cloud KMS Rotation security kubernetes_ops cloud L3
18 Job Queue Backlog, Worker Pod CPU Throttled, Fix Is cgroup Config observability kubernetes_ops linux_ops L2
19 Ansible Playbook Hangs, SSH Agent Forwarding Broken, Root Cause Is Firewall Rule devops_tooling linux_ops networking L2
20 Canary Deploy Looks Healthy, Actually Routing to Wrong Backend, Ingress Misconfigured devops_tooling networking kubernetes_ops L3

Domain Coverage

Domain Appearances Incidents
kubernetes_ops 10 1, 2, 4, 6, 7, 9, 10, 11, 14, 15, 16, 17, 18, 20
networking 7 1, 5, 7, 8, 11, 12, 14, 19, 20
observability 7 2, 3, 5, 7, 10, 13, 14, 18
devops_tooling 7 2, 3, 4, 6, 9, 13, 15, 19, 20
security 5 1, 5, 6, 11, 15, 17
linux_ops 6 3, 8, 9, 10, 12, 16, 18, 19
cloud 3 8, 13, 17
datacenter_ops 3 4, 12, 16

How to Use

Each incident directory contains 5 files:

  1. symptoms.md — Start here. Read the initial alert and observable symptoms. Try to form a hypothesis before continuing.
  2. questions.md — 5 diagnostic questions to test your reasoning. Answer these before reading the investigation.
  3. investigation.md — The full investigation path: Domain A dead end, the pivot clue, and Domain B root cause.
  4. remediation.md — The fix in Domain C, verification across all 3 domains, and prevention measures.
  5. grading.md — Rubric for self-assessment and prerequisite topic packs.

Difficulty Levels

  • L2 (12 incidents): Requires solid fundamentals in all 3 domains. A mid-level engineer with 2-3 years experience should be able to solve these with some guidance.
  • L3 (8 incidents): Requires deep knowledge and experience with cross-domain failure modes. Senior/staff-level troubleshooting.