Ops Archaeology: The Job That Succeeded Wrong¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, Jobs, ConfigMaps, Database Ops
Artifact 1: CLI Output¶
$ kubectl get jobs -n analytics --sort-by=.status.startTime
NAME COMPLETIONS DURATION AGE
daily-rollup-28364800 1/1 4m12s 25h
daily-rollup-28365600 1/1 4m08s 1h
$ kubectl get pods -n analytics -l job-name=daily-rollup-28365600
NAME READY STATUS RESTARTS AGE
daily-rollup-28365600-v9g4r 0/1 Completed 0 1h
$ kubectl logs daily-rollup-28365600-v9g4r -n analytics | tail -8
2024-12-10T06:00:04Z INFO Connecting to database...
2024-12-10T06:00:04Z INFO Connected to PostgreSQL 15.4 at analytics-db:5432/warehouse
2024-12-10T06:00:04Z INFO Starting daily rollup for 2024-12-09
2024-12-10T06:00:05Z INFO Processing table: order_summaries
2024-12-10T06:00:38Z INFO Rolled up 14,293 records into 847 summary rows
2024-12-10T06:02:11Z INFO Processing table: revenue_by_region
2024-12-10T06:02:44Z INFO Rolled up 14,293 records into 12 region summaries
2024-12-10T06:04:08Z INFO Daily rollup complete. Duration: 4m04s. Status: SUCCESS
Artifact 2: Metrics¶
# Dashboard: "Analytics Pipeline Health"
# Daily revenue total (from rollup output) — last 7 days
analytics_daily_revenue_usd{date="2024-12-03"} 2847291
analytics_daily_revenue_usd{date="2024-12-04"} 2913847
analytics_daily_revenue_usd{date="2024-12-05"} 2791003
analytics_daily_revenue_usd{date="2024-12-06"} 3104582
analytics_daily_revenue_usd{date="2024-12-07"} 3247109
analytics_daily_revenue_usd{date="2024-12-08"} 287412
analytics_daily_revenue_usd{date="2024-12-09"} 291034
# Record counts processed
analytics_rollup_input_records{date="2024-12-08"} 14293
analytics_rollup_input_records{date="2024-12-09"} 14293
analytics_rollup_input_records{date="2024-12-07"} 142918
Artifact 3: Infrastructure Code¶
# From: k8s/configmaps/analytics-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: analytics-config
namespace: analytics
data:
DB_HOST: "analytics-db"
DB_PORT: "5432"
DB_NAME: "warehouse"
DB_USER: "rollup_svc"
---
# From: k8s/configmaps/analytics-config-staging.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: analytics-config
namespace: analytics
data:
DB_HOST: "analytics-db"
DB_PORT: "5432"
DB_NAME: "warehouse_staging"
DB_USER: "rollup_svc"
Artifact 4: Log Lines¶
[2024-12-10T06:00:04Z] rollup-job | INFO Connected to PostgreSQL 15.4 at analytics-db:5432/warehouse
[2024-12-08T05:59:58Z] k8s-deploy | ConfigMap analytics-config replaced in namespace analytics
[2024-12-10T06:04:08Z] rollup-job | INFO Daily rollup complete. Duration: 4m04s. Status: SUCCESS
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?