Postmortem: Helm Values Mismatch Routes Staging Traffic to Production Database¶

Field	Value
ID	PM-007
Date	2025-04-22
Severity	SEV-2
Duration	38m (detection to resolution)
Time to Detect	23m
Time to Mitigate	38m
Customer Impact	No direct customer-facing degradation; backend data integrity at risk for 23 minutes
Revenue Impact	none (direct); data recovery and audit cost estimated at $3,100 in engineering hours
Teams Involved	Platform Engineering, Data Engineering, Storefront Backend, Database Reliability
Postmortem Author	Kenji Watanabe
Postmortem Date	2025-04-25

Executive Summary¶

On 2025-04-22, a Helm values refactoring effort accidentally introduced a production PostgreSQL connection string into the storefront-staging values file. When the staging environment's nightly integration test suite ran at 03:00 UTC, it executed destructive SQL operations — DELETE FROM and TRUNCATE TABLE — against live production tables. The issue was detected 23 minutes into the test run when a Database Reliability engineer noticed anomalous write activity in production monitoring dashboards. Two production tables (categories and cms_blocks) were truncated before the test suite was halted. Data was restored from a point-in-time snapshot within 15 minutes of detection. No customer-facing service experienced visible degradation during the incident.

Timeline (All times UTC)¶

Time	Event
2025-04-21 17:30	Fatima Reyes begins Helm values refactoring: consolidating duplicate connection strings across staging, QA, and production values files into a shared `_base.yaml`
2025-04-21 18:12	Fatima copy-pastes `POSTGRES_DSN` from `values-production.yaml` into `values-staging.yaml` as a placeholder, intending to replace it with the staging DSN before committing
2025-04-21 18:45	Fatima is pulled into a P1 customer escalation call; does not return to the values refactoring
2025-04-21 19:30	Fatima commits and pushes the refactoring branch, including `values-staging.yaml` with the production DSN still in place; commit message: "chore: consolidate helm values base"
2025-04-21 19:35	PR is auto-approved by the `platform-eng` CODEOWNERS bot (file is not in a protected path) and merged to main
2025-04-21 20:00	ArgoCD syncs `storefront-staging` with the updated Helm values; staging pods restart with production `POSTGRES_DSN`
2025-04-21 20:05	No smoke test validates the connection string target; staging pods report healthy
2025-04-22 03:00	Nightly integration test suite begins on `storefront-staging`; first tests populate fixture data
2025-04-22 03:02	Test suite begins "teardown and rebuild" phase; issues `DELETE FROM categories WHERE id > 0` against production DB
2025-04-22 03:04	`TRUNCATE TABLE categories CASCADE` executes; 2,847 category records deleted from production
2025-04-22 03:06	`TRUNCATE TABLE cms_blocks CASCADE` executes; 412 CMS block records deleted from production
2025-04-22 03:08	Test suite moves to table `coupons`; begins `DELETE FROM coupons WHERE created_at < NOW()`
2025-04-22 03:12	Database Reliability on-call (Oluwaseun Adeyemi) sees spike in `n_dead_tup` and `autovacuum` triggers in Grafana; unusual for 03:12 UTC on a Tuesday
2025-04-22 03:18	Oluwaseun queries `pg_stat_activity`; sees connection from `storefront-staging` app user hitting production cluster
2025-04-22 03:20	Oluwaseun kills staging connection, pages Incident Commander Kenji Watanabe, and posts in #incidents
2025-04-22 03:23	Kenji joins; confirms `values-staging.yaml` contains production DSN; test suite halted via ArgoCD manual sync suspend
2025-04-22 03:25	Fatima paged; confirms the copy-paste oversight
2025-04-22 03:28	Database Reliability initiates point-in-time recovery (PITR) for `categories` and `cms_blocks` from 02:58 UTC snapshot
2025-04-22 03:35	Staging `values-staging.yaml` patched with correct staging DSN; pushed directly to main
2025-04-22 03:38	PITR restore for both tables complete and validated; row counts match pre-incident baseline
2025-04-22 03:40	ArgoCD sync re-enabled for staging; pods restart against staging DB
2025-04-22 03:45	All-clear; data integrity confirmed by spot-check queries from Data Engineering
2025-04-22 04:00	Post-incident data audit begins; `coupons` table examined, no rows deleted (test was stopped mid-`DELETE`)

Impact¶

Customer Impact¶

No customer-facing service experienced visible degradation. The categories and cms_blocks tables back the product catalog and CMS rendering pipeline, but both services have a 5-minute in-memory cache. The 38-minute incident occurred at 03:00 UTC, during the lowest-traffic window of the week (< 200 concurrent users), and the cache masked the missing data. Had the incident occurred during peak hours or extended beyond ~5 minutes post-cache-expiry, catalog pages would have returned empty results.

Internal Impact¶

Database Reliability spent 4 hours on PITR validation and data audit post-restore. Data Engineering spent 3 hours running integrity checks across all tables that the test suite touched (partial). Platform Engineering spent 2 hours patching and re-validating the Helm values refactoring. Total: approximately 9 engineer-hours. The nightly staging test suite was not re-run; a full QA sign-off was deferred by one day, delaying a planned storefront feature release.

Data Impact¶

Two production tables were truncated: categories (2,847 rows) and cms_blocks (412 rows). Both were fully restored from PITR. The coupons table received a partial DELETE that was stopped before any rows were committed (the connection was killed during the DELETE execution; PostgreSQL rolled back the uncommitted transaction). Data Engineering confirmed no other tables were modified. Write-ahead log (WAL) audit confirmed no data left the production database cluster.

Root Cause¶

What Happened (Technical)¶

The storefront-staging Helm chart separates connection strings by environment using distinct values files: values-staging.yaml, values-production.yaml, and values-qa.yaml. During a refactoring effort to deduplicate shared configuration into a _base.yaml, Fatima needed to reference the production DSN format as a template. She copy-pasted the literal production connection string — postgresql://storefront_app:REDACTED@prod-pg-cluster.internal:5432/storefront — into values-staging.yaml as a placeholder. An interruption prevented her from replacing it before committing.

The production DSN uses a dedicated application user (storefront_app) that has SELECT, INSERT, UPDATE, and DELETE privileges on all storefront schema tables, and TRUNCATE privilege on tables the application manages (including categories and cms_blocks). The staging application user has the same privilege set against the staging cluster. There is no network policy or database-level restriction preventing the staging Kubernetes namespace from reaching the production PostgreSQL cluster; both are within the same VPC and the security group allows inbound 5432 from any internal CIDR.

The nightly integration test suite runs in the storefront-staging namespace and assumes it is operating against a staging database. Its setup phase calls a reset_db() fixture that issues TRUNCATE TABLE on a hardcoded list of 23 tables, alphabetically ordered, before re-populating them with fixture data. This fixture has no guard that validates the target database hostname against an allowlist before executing destructive operations. Tables are processed alphabetically: categories and cms_blocks appear before coupons, orders, and payments.

Contributing Factors¶

Connection strings stored as plain text in values files: Production DSNs in values-production.yaml are stored in cleartext, not as references to Kubernetes Secrets or external secret manager values (e.g., AWS Secrets Manager via External Secrets Operator). This means a copy-paste of a values file block inadvertently copies a live, working credential rather than a symbolic reference that would fail to resolve in the wrong context.
No network policy separating staging from production database: The production PostgreSQL cluster accepts connections from the staging Kubernetes namespace because no NetworkPolicy or security group rule blocks cross-environment database traffic. An environment-isolated database network would have caused the connection to fail immediately, making the misconfiguration detectable before any data was modified.
Staging test suite does not use transaction wrapping for destructive operations: The reset_db() fixture issues DDL (TRUNCATE) and DML (DELETE) without wrapping them in a transaction that could be rolled back if a pre-condition check fails. A guard of the form assert 'staging' in db_host before any destructive operation would have raised an exception immediately and prevented any data modification.

What We Got Lucky About¶

The test suite processes tables alphabetically. The incident was caught and stopped at the "C" tables (categories, cms_blocks). Tables that begin with "O" and "P" (orders, payments, payment_methods) were never reached. Truncating orders or payments would have been a SEV-1 data loss event with significantly longer recovery time and potential regulatory implications.
The PITR backup taken at 02:58 UTC — just 2 minutes before the destructive operations began — was clean and complete. The restore was straightforward and took 10 minutes to execute and 5 minutes to validate. A longer backup interval or a corrupted snapshot would have escalated recovery complexity significantly.

Detection¶

How We Detected¶

Database Reliability engineer Oluwaseun Adeyemi was doing a routine 03:00 UTC shift handoff review of Grafana dashboards when he noticed n_dead_tup spiking on production tables and autovacuum triggering mid-cycle. This is anomalous at 03:12 UTC and prompted him to inspect pg_stat_activity, where he saw the storefront-staging application user connected to the production cluster. Detection was human and coincidental — no automated alert fired.

Why We Didn't Detect Sooner¶

No alert exists for cross-environment database connections. The production PostgreSQL cluster does not log or alert on connections from application users that originate from unexpected source namespaces or IPs. ArgoCD sync did not flag the values change as dangerous because it has no awareness of what connection strings point to. The PR review was automated (CODEOWNERS bot) because values-staging.yaml is not a protected file. No human reviewed the diff.

Response¶

What Went Well¶

Oluwaseun's decision to immediately kill the staging connection rather than investigate further was correct and prevented additional tables from being truncated. Acting first and investigating second saved at minimum 5–10 minutes of additional data destruction.
PITR was well-practiced. The Database Reliability team had run a PITR drill 6 weeks prior; the restore procedure was documented and executed without hesitation. The 10-minute restore time is within the team's SLA.
Data Engineering's post-incident audit was thorough and completed within 4 hours, providing confidence that no other tables were affected.

What Went Poorly¶

The PR that introduced the bad values file was auto-merged without human review. A configuration file containing a production database DSN should require mandatory human review, regardless of which path it lives in.
There was no automated detection. The 23-minute detection window was entirely dependent on an engineer happening to look at dashboards. An alert for "staging application user connecting to production cluster" would have fired within seconds.
Communication lag: Fatima was not paged until 03:25 — 5 minutes after root cause was confirmed. Earlier notification would not have changed the outcome but would have accelerated the values file patch.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
AI-007-01	Migrate all production DSNs from plaintext in values files to External Secrets Operator references; staging values must not be able to resolve production secret paths	P0	Platform Engineering	Open	2025-05-09
AI-007-02	Add NetworkPolicy blocking egress from staging namespace to production database CIDR on port 5432	P0	Platform Engineering	Open	2025-05-02
AI-007-03	Add pre-condition guard in `reset_db()` fixture: assert target DB hostname matches `-staging` pattern before executing any TRUNCATE or DELETE	P0	Storefront Backend	Open	2025-04-29
AI-007-04	Add CODEOWNERS rule requiring human approval for any change to files matching `values.yaml` that contains a hostname, DSN, or connection string	P1	Platform Engineering	Open	2025-04-29
AI-007-05	Create Grafana alert: fire when a connection to production PostgreSQL originates from a source IP outside the production CIDR range	P1	Database Reliability	Open	2025-05-06
AI-007-06	Run tabletop exercise: simulate staging-hits-prod database scenario in the next on-call training cycle	P2	Incident Command	Open	2025-05-30

Lessons Learned¶

Plaintext credentials in config files propagate silently: When a DSN is a symbolic reference to a secret store, copying the reference to the wrong file is immediately detectable (it won't resolve). When it's a literal connection string, the misconfiguration is invisible until the application successfully connects — to the wrong target.
Alphabetical test ordering is not a safety net, but it saved us: The fact that critical tables (orders, payments) appear late in the alphabet is coincidence. Tests that execute destructive operations must validate their execution environment explicitly, not rely on being stopped before they reach sensitive data.
Human detection is not a monitoring strategy: A 23-minute window of production data destruction was closed by an engineer glancing at a dashboard during a shift handoff. Cross-environment connection anomalies are machine-detectable within seconds and should generate automated alerts.

Cross-References¶

Failure Pattern: Configuration Leak — Staging Inherits Production Credentials; Destructive Test Without Environment Guard
Topic Packs: Kubernetes Networking and NetworkPolicy; Helm Values Management; External Secrets Operator; Database PITR
Runbook: runbooks/database/pitr-restore.md; runbooks/deployments/helm-values-audit.md
Decision Tree: Unexpected Production Writes → Check pg_stat_activity Source → Identify Application User → Kill Connection → Scope Damage → PITR vs Row-Level Restore