Answer Key: The Job That Succeeded Wrong¶

The System¶

A daily analytics pipeline that aggregates e-commerce transaction data. A CronJob runs every morning at 06:00 UTC, reading raw orders from a PostgreSQL data warehouse and producing summary tables (revenue by region, order summaries).

[CronJob: daily-rollup] --> [Job Pod]
                                |
                           reads ConfigMap: analytics-config
                                |
                           DB_HOST: analytics-db
                           DB_NAME: warehouse (should be prod)
                                    warehouse_staging (is actually staging)
                                |
                           [PostgreSQL: analytics-db]
                              /              \
                    [warehouse]          [warehouse_staging]
                    142K records/day      14K test records
                    $3M daily revenue     $290K test revenue

What's Broken¶

Root cause: On December 8, someone applied the staging ConfigMap (analytics-config-staging.yaml) to the production namespace. Both files define a ConfigMap named analytics-config in namespace analytics, but the staging version points to DB_NAME: warehouse_staging. The kubectl apply replaced the production ConfigMap with the staging one.

The job connects to the same database server (analytics-db:5432) but reads from the staging database (warehouse_staging) which has only ~14K test records — about 10x fewer than production. The rollup runs successfully because the schema is identical, but the output is wrong: revenue figures are 10x lower ($290K vs $3M) and record counts are 10x lower (14,293 vs 142,918).

The job log shows Connected to ... analytics-db:5432/warehouse which looks correct. But this is misleading — the log may be printing the configured database name from the ConfigMap at startup, while the actual connection uses the staging database. Or the log format is truncated. The definitive evidence is the metrics: identical record counts (14,293) on Dec 8 and 9 (staging data is static), and revenue dropping from $3M to $290K.

Key clue: Revenue dropping 10x on Dec 8, same day the ConfigMap was replaced. Record counts are identical on Dec 8 and 9 (14,293) — static test data does not change day-to-day.

The Fix¶

Immediate (restore production config)¶

# Apply the correct production ConfigMap
kubectl apply -f k8s/configmaps/analytics-config.yaml

# Verify the ConfigMap contents
kubectl get configmap analytics-config -n analytics -o yaml

# Re-run the rollup for the affected dates
kubectl create job daily-rollup-rerun-1208 --from=cronjob/daily-rollup -n analytics

Permanent¶

Rename the staging ConfigMap to avoid collision:

# analytics-config-staging.yaml should use a different name:
metadata:
  name: analytics-config
  namespace: analytics-staging   # Different NAMESPACE, not same namespace

Add guardrails:

# Add a namespace label check to CI
# Staging configs should target a staging namespace, not analytics

Add a data quality check to the rollup job:

# In the rollup script, add a sanity check:
if record_count < expected_minimum:
    log.error(f"Record count {record_count} below threshold {expected_minimum}")
    sys.exit(1)

Backfill the corrupted data:

# Re-run for each affected date
for date in 2024-12-08 2024-12-09; do
  kubectl create job daily-rollup-backfill-${date} \
    --from=cronjob/daily-rollup -n analytics \
    -- --date=${date} --force
done

Verification¶

# Check the ConfigMap is correct
kubectl get configmap analytics-config -n analytics -o jsonpath='{.data.DB_NAME}'
# Should output: warehouse

# Wait for next scheduled run and check metrics
kubectl logs -n analytics -l job-name=daily-rollup-28366400 --tail=5

# Verify revenue is back to normal range
curl -s http://analytics-dashboard:9090/api/v1/query?query=analytics_daily_revenue_usd

Artifact Decoder¶

Artifact	What It Revealed	What Was Misleading
CLI Output	Job shows Completed/SUCCESS with normal duration — nothing looks wrong	"1/1 Completed" in 4 minutes looks perfectly healthy
Metrics	Revenue 10x drop on Dec 8, identical record counts (14,293) on Dec 8 and 9 = static test data	The metric names and structure look normal; you must notice the magnitude change
IaC Snippet	Two ConfigMaps with the same name in the same namespace — applying the wrong one replaces the right one	Both configs look valid individually; the danger is in the naming collision
Log Lines	ConfigMap replaced on Dec 8 — correlates with the revenue drop	Job log says "warehouse" which is the prod database name — but the data proves otherwise

Skills Demonstrated¶

Recognizing data correctness issues vs execution failures (the job "succeeded" but produced wrong results)
Interpreting metric magnitude changes as diagnostic signals
Understanding Kubernetes ConfigMap replacement semantics
Identifying namespace/naming collision risks in Kubernetes manifests
Designing data quality gates for batch jobs

Answer Key: The Job That Succeeded Wrong¶

The System¶

What's Broken¶

The Fix¶

Immediate (restore production config)¶

Permanent¶

Verification¶

Artifact Decoder¶

Skills Demonstrated¶

Prerequisite Topic Packs¶

Pages that link here¶