Skip to content

Answer Key: The Job That Succeeded Wrong

The System

A daily analytics pipeline that aggregates e-commerce transaction data. A CronJob runs every morning at 06:00 UTC, reading raw orders from a PostgreSQL data warehouse and producing summary tables (revenue by region, order summaries).

[CronJob: daily-rollup] --> [Job Pod]
                                |
                           reads ConfigMap: analytics-config
                                |
                           DB_HOST: analytics-db
                           DB_NAME: warehouse (should be prod)
                                    warehouse_staging (is actually staging)
                                |
                           [PostgreSQL: analytics-db]
                              /              \
                    [warehouse]          [warehouse_staging]
                    142K records/day      14K test records
                    $3M daily revenue     $290K test revenue

What's Broken

Root cause: On December 8, someone applied the staging ConfigMap (analytics-config-staging.yaml) to the production namespace. Both files define a ConfigMap named analytics-config in namespace analytics, but the staging version points to DB_NAME: warehouse_staging. The kubectl apply replaced the production ConfigMap with the staging one.

The job connects to the same database server (analytics-db:5432) but reads from the staging database (warehouse_staging) which has only ~14K test records — about 10x fewer than production. The rollup runs successfully because the schema is identical, but the output is wrong: revenue figures are 10x lower ($290K vs $3M) and record counts are 10x lower (14,293 vs 142,918).

The job log shows Connected to ... analytics-db:5432/warehouse which looks correct. But this is misleading — the log may be printing the configured database name from the ConfigMap at startup, while the actual connection uses the staging database. Or the log format is truncated. The definitive evidence is the metrics: identical record counts (14,293) on Dec 8 and 9 (staging data is static), and revenue dropping from $3M to $290K.

Key clue: Revenue dropping 10x on Dec 8, same day the ConfigMap was replaced. Record counts are identical on Dec 8 and 9 (14,293) — static test data does not change day-to-day.

The Fix

Immediate (restore production config)

# Apply the correct production ConfigMap
kubectl apply -f k8s/configmaps/analytics-config.yaml

# Verify the ConfigMap contents
kubectl get configmap analytics-config -n analytics -o yaml

# Re-run the rollup for the affected dates
kubectl create job daily-rollup-rerun-1208 --from=cronjob/daily-rollup -n analytics

Permanent

  1. Rename the staging ConfigMap to avoid collision:

    # analytics-config-staging.yaml should use a different name:
    metadata:
      name: analytics-config
      namespace: analytics-staging   # Different NAMESPACE, not same namespace
    

  2. Add guardrails:

    # Add a namespace label check to CI
    # Staging configs should target a staging namespace, not analytics
    

  3. Add a data quality check to the rollup job:

    # In the rollup script, add a sanity check:
    if record_count < expected_minimum:
        log.error(f"Record count {record_count} below threshold {expected_minimum}")
        sys.exit(1)
    

  4. Backfill the corrupted data:

    # Re-run for each affected date
    for date in 2024-12-08 2024-12-09; do
      kubectl create job daily-rollup-backfill-${date} \
        --from=cronjob/daily-rollup -n analytics \
        -- --date=${date} --force
    done
    

Verification

# Check the ConfigMap is correct
kubectl get configmap analytics-config -n analytics -o jsonpath='{.data.DB_NAME}'
# Should output: warehouse

# Wait for next scheduled run and check metrics
kubectl logs -n analytics -l job-name=daily-rollup-28366400 --tail=5

# Verify revenue is back to normal range
curl -s http://analytics-dashboard:9090/api/v1/query?query=analytics_daily_revenue_usd

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output Job shows Completed/SUCCESS with normal duration — nothing looks wrong "1/1 Completed" in 4 minutes looks perfectly healthy
Metrics Revenue 10x drop on Dec 8, identical record counts (14,293) on Dec 8 and 9 = static test data The metric names and structure look normal; you must notice the magnitude change
IaC Snippet Two ConfigMaps with the same name in the same namespace — applying the wrong one replaces the right one Both configs look valid individually; the danger is in the naming collision
Log Lines ConfigMap replaced on Dec 8 — correlates with the revenue drop Job log says "warehouse" which is the prod database name — but the data proves otherwise

Skills Demonstrated

  • Recognizing data correctness issues vs execution failures (the job "succeeded" but produced wrong results)
  • Interpreting metric magnitude changes as diagnostic signals
  • Understanding Kubernetes ConfigMap replacement semantics
  • Identifying namespace/naming collision risks in Kubernetes manifests
  • Designing data quality gates for batch jobs

Prerequisite Topic Packs