Skip to content

Ops Archaeology: The Job That Succeeded Wrong

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, Jobs, ConfigMaps, Database Ops


Artifact 1: CLI Output

$ kubectl get jobs -n analytics --sort-by=.status.startTime
NAME                           COMPLETIONS   DURATION   AGE
daily-rollup-28364800          1/1           4m12s      25h
daily-rollup-28365600          1/1           4m08s      1h

$ kubectl get pods -n analytics -l job-name=daily-rollup-28365600
NAME                           READY   STATUS      RESTARTS   AGE
daily-rollup-28365600-v9g4r    0/1     Completed   0          1h

$ kubectl logs daily-rollup-28365600-v9g4r -n analytics | tail -8
2024-12-10T06:00:04Z INFO  Connecting to database...
2024-12-10T06:00:04Z INFO  Connected to PostgreSQL 15.4 at analytics-db:5432/warehouse
2024-12-10T06:00:04Z INFO  Starting daily rollup for 2024-12-09
2024-12-10T06:00:05Z INFO  Processing table: order_summaries
2024-12-10T06:00:38Z INFO  Rolled up 14,293 records into 847 summary rows
2024-12-10T06:02:11Z INFO  Processing table: revenue_by_region
2024-12-10T06:02:44Z INFO  Rolled up 14,293 records into 12 region summaries
2024-12-10T06:04:08Z INFO  Daily rollup complete. Duration: 4m04s. Status: SUCCESS

Artifact 2: Metrics

# Dashboard: "Analytics Pipeline Health"

# Daily revenue total (from rollup output) — last 7 days
analytics_daily_revenue_usd{date="2024-12-03"} 2847291
analytics_daily_revenue_usd{date="2024-12-04"} 2913847
analytics_daily_revenue_usd{date="2024-12-05"} 2791003
analytics_daily_revenue_usd{date="2024-12-06"} 3104582
analytics_daily_revenue_usd{date="2024-12-07"} 3247109
analytics_daily_revenue_usd{date="2024-12-08"} 287412
analytics_daily_revenue_usd{date="2024-12-09"} 291034

# Record counts processed
analytics_rollup_input_records{date="2024-12-08"} 14293
analytics_rollup_input_records{date="2024-12-09"} 14293
analytics_rollup_input_records{date="2024-12-07"} 142918

Artifact 3: Infrastructure Code

# From: k8s/configmaps/analytics-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: analytics-config
  namespace: analytics
data:
  DB_HOST: "analytics-db"
  DB_PORT: "5432"
  DB_NAME: "warehouse"
  DB_USER: "rollup_svc"
---
# From: k8s/configmaps/analytics-config-staging.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: analytics-config
  namespace: analytics
data:
  DB_HOST: "analytics-db"
  DB_PORT: "5432"
  DB_NAME: "warehouse_staging"
  DB_USER: "rollup_svc"

Artifact 4: Log Lines

[2024-12-10T06:00:04Z] rollup-job  | INFO  Connected to PostgreSQL 15.4 at analytics-db:5432/warehouse
[2024-12-08T05:59:58Z] k8s-deploy  | ConfigMap analytics-config replaced in namespace analytics
[2024-12-10T06:04:08Z] rollup-job  | INFO  Daily rollup complete. Duration: 4m04s. Status: SUCCESS

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?