Answer Key: The Job That Succeeded Wrong¶
The System¶
A daily analytics pipeline that aggregates e-commerce transaction data. A CronJob runs every morning at 06:00 UTC, reading raw orders from a PostgreSQL data warehouse and producing summary tables (revenue by region, order summaries).
[CronJob: daily-rollup] --> [Job Pod]
|
reads ConfigMap: analytics-config
|
DB_HOST: analytics-db
DB_NAME: warehouse (should be prod)
warehouse_staging (is actually staging)
|
[PostgreSQL: analytics-db]
/ \
[warehouse] [warehouse_staging]
142K records/day 14K test records
$3M daily revenue $290K test revenue
What's Broken¶
Root cause: On December 8, someone applied the staging ConfigMap (analytics-config-staging.yaml) to the production namespace. Both files define a ConfigMap named analytics-config in namespace analytics, but the staging version points to DB_NAME: warehouse_staging. The kubectl apply replaced the production ConfigMap with the staging one.
The job connects to the same database server (analytics-db:5432) but reads from the staging database (warehouse_staging) which has only ~14K test records — about 10x fewer than production. The rollup runs successfully because the schema is identical, but the output is wrong: revenue figures are 10x lower ($290K vs $3M) and record counts are 10x lower (14,293 vs 142,918).
The job log shows Connected to ... analytics-db:5432/warehouse which looks correct. But this is misleading — the log may be printing the configured database name from the ConfigMap at startup, while the actual connection uses the staging database. Or the log format is truncated. The definitive evidence is the metrics: identical record counts (14,293) on Dec 8 and 9 (staging data is static), and revenue dropping from $3M to $290K.
Key clue: Revenue dropping 10x on Dec 8, same day the ConfigMap was replaced. Record counts are identical on Dec 8 and 9 (14,293) — static test data does not change day-to-day.
The Fix¶
Immediate (restore production config)¶
# Apply the correct production ConfigMap
kubectl apply -f k8s/configmaps/analytics-config.yaml
# Verify the ConfigMap contents
kubectl get configmap analytics-config -n analytics -o yaml
# Re-run the rollup for the affected dates
kubectl create job daily-rollup-rerun-1208 --from=cronjob/daily-rollup -n analytics
Permanent¶
-
Rename the staging ConfigMap to avoid collision:
-
Add guardrails:
-
Add a data quality check to the rollup job:
-
Backfill the corrupted data:
Verification¶
# Check the ConfigMap is correct
kubectl get configmap analytics-config -n analytics -o jsonpath='{.data.DB_NAME}'
# Should output: warehouse
# Wait for next scheduled run and check metrics
kubectl logs -n analytics -l job-name=daily-rollup-28366400 --tail=5
# Verify revenue is back to normal range
curl -s http://analytics-dashboard:9090/api/v1/query?query=analytics_daily_revenue_usd
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | Job shows Completed/SUCCESS with normal duration — nothing looks wrong | "1/1 Completed" in 4 minutes looks perfectly healthy |
| Metrics | Revenue 10x drop on Dec 8, identical record counts (14,293) on Dec 8 and 9 = static test data | The metric names and structure look normal; you must notice the magnitude change |
| IaC Snippet | Two ConfigMaps with the same name in the same namespace — applying the wrong one replaces the right one | Both configs look valid individually; the danger is in the naming collision |
| Log Lines | ConfigMap replaced on Dec 8 — correlates with the revenue drop | Job log says "warehouse" which is the prod database name — but the data proves otherwise |
Skills Demonstrated¶
- Recognizing data correctness issues vs execution failures (the job "succeeded" but produced wrong results)
- Interpreting metric magnitude changes as diagnostic signals
- Understanding Kubernetes ConfigMap replacement semantics
- Identifying namespace/naming collision risks in Kubernetes manifests
- Designing data quality gates for batch jobs