case-study
ops-archaeology
Ops Archaeology: Reverse-Engineering Production Systems
15 exercises where you're dropped into an unfamiliar system with no documentation.
Your only clues are 4 partial artifacts: CLI output, metrics, infrastructure code, and log lines.
Reconstruct what the system does, diagnose what's broken, and propose a fix.
How to Use
Read the briefing.md — absorb the artifacts without jumping to conclusions
Set a timer for the estimated duration
Write down your reconstruction, diagnosis, and proposed fix
Check hints.md if you're stuck (progressive, timed)
Compare against answer_key.md when done
Difficulty Tiers
L1 — Single Service, Obvious Failure (15 min each)
#
Exercise
Domain
01
Redis OOM CrashLoop
Kubernetes, Redis, Resource Management
02
Systemd Permission Denied
Linux, Systemd, Ansible, File Permissions
03
Docker Exec Format Error
Docker, Container Images, CI/CD
04
Postgres Replica Lag
Kubernetes, PostgreSQL, Terraform, Networking
05
Nginx 502 Backend
Kubernetes, Ingress, Helm, Networking
L2 — Multi-Service, Subtle Failure (25 min each)
#
Exercise
Domain
06
Service Mesh Silent Drops
Kubernetes, Service Mesh, Observability
07
Stale Image Tag Deployment
Kubernetes, ArgoCD, Container Registry, CI/CD
08
Job Wrong Database
Kubernetes, Jobs, ConfigMaps, Database Ops
09
Monitoring Gap
Prometheus, Alertmanager, Grafana, Observability
10
Intermittent DNS Failures
Kubernetes, CoreDNS, Networking
L3 — Distributed System, Misleading Artifacts (40 min each)
#
Exercise
Domain
11
DR Failover Silently Broken
Multi-Cluster, Route53, Database, Disaster Recovery
12
TLS Chain Incomplete
TLS/PKI, Cert-Manager, Ingress, API Clients
13
Quota Masquerading as Scheduling
Kubernetes, Resource Quotas, HPA, Scheduling
14
Split-Brain After Partition
etcd, Distributed Systems, Terraform State
15
Cascading Debug Log Failure
Logging, Disk I/O, Observability, Change Management
Pages that link here
March 25, 2026 03:10:26
March 19, 2026 13:58:35