Skip to content

Ops Archaeology: Reverse-Engineering Production Systems

15 exercises where you're dropped into an unfamiliar system with no documentation. Your only clues are 4 partial artifacts: CLI output, metrics, infrastructure code, and log lines. Reconstruct what the system does, diagnose what's broken, and propose a fix.

How to Use

  1. Read the briefing.md — absorb the artifacts without jumping to conclusions
  2. Set a timer for the estimated duration
  3. Write down your reconstruction, diagnosis, and proposed fix
  4. Check hints.md if you're stuck (progressive, timed)
  5. Compare against answer_key.md when done

Difficulty Tiers

L1 — Single Service, Obvious Failure (15 min each)

# Exercise Domain
01 Redis OOM CrashLoop Kubernetes, Redis, Resource Management
02 Systemd Permission Denied Linux, Systemd, Ansible, File Permissions
03 Docker Exec Format Error Docker, Container Images, CI/CD
04 Postgres Replica Lag Kubernetes, PostgreSQL, Terraform, Networking
05 Nginx 502 Backend Kubernetes, Ingress, Helm, Networking

L2 — Multi-Service, Subtle Failure (25 min each)

# Exercise Domain
06 Service Mesh Silent Drops Kubernetes, Service Mesh, Observability
07 Stale Image Tag Deployment Kubernetes, ArgoCD, Container Registry, CI/CD
08 Job Wrong Database Kubernetes, Jobs, ConfigMaps, Database Ops
09 Monitoring Gap Prometheus, Alertmanager, Grafana, Observability
10 Intermittent DNS Failures Kubernetes, CoreDNS, Networking

L3 — Distributed System, Misleading Artifacts (40 min each)

# Exercise Domain
11 DR Failover Silently Broken Multi-Cluster, Route53, Database, Disaster Recovery
12 TLS Chain Incomplete TLS/PKI, Cert-Manager, Ingress, API Clients
13 Quota Masquerading as Scheduling Kubernetes, Resource Quotas, HPA, Scheduling
14 Split-Brain After Partition etcd, Distributed Systems, Terraform State
15 Cascading Debug Log Failure Logging, Disk I/O, Observability, Change Management