Ops Archaeology: Reverse-Engineering Production Systems¶

15 exercises where you're dropped into an unfamiliar system with no documentation. Your only clues are 4 partial artifacts: CLI output, metrics, infrastructure code, and log lines. Reconstruct what the system does, diagnose what's broken, and propose a fix.

How to Use¶

Read the briefing.md — absorb the artifacts without jumping to conclusions
Set a timer for the estimated duration
Write down your reconstruction, diagnosis, and proposed fix
Check hints.md if you're stuck (progressive, timed)
Compare against answer_key.md when done

Difficulty Tiers¶

L1 — Single Service, Obvious Failure (15 min each)¶

#	Exercise	Domain
01	Redis OOM CrashLoop	Kubernetes, Redis, Resource Management
02	Systemd Permission Denied	Linux, Systemd, Ansible, File Permissions
03	Docker Exec Format Error	Docker, Container Images, CI/CD
04	Postgres Replica Lag	Kubernetes, PostgreSQL, Terraform, Networking
05	Nginx 502 Backend	Kubernetes, Ingress, Helm, Networking

L2 — Multi-Service, Subtle Failure (25 min each)¶

#	Exercise	Domain
06	Service Mesh Silent Drops	Kubernetes, Service Mesh, Observability
07	Stale Image Tag Deployment	Kubernetes, ArgoCD, Container Registry, CI/CD
08	Job Wrong Database	Kubernetes, Jobs, ConfigMaps, Database Ops
09	Monitoring Gap	Prometheus, Alertmanager, Grafana, Observability
10	Intermittent DNS Failures	Kubernetes, CoreDNS, Networking

L3 — Distributed System, Misleading Artifacts (40 min each)¶

#	Exercise	Domain
11	DR Failover Silently Broken	Multi-Cluster, Route53, Database, Disaster Recovery
12	TLS Chain Incomplete	TLS/PKI, Cert-Manager, Ingress, API Clients
13	Quota Masquerading as Scheduling	Kubernetes, Resource Quotas, HPA, Scheduling
14	Split-Brain After Partition	etcd, Distributed Systems, Terraform State
15	Cascading Debug Log Failure	Logging, Disk I/O, Observability, Change Management