Postmortem: Helm Values Mismatch Routes Staging Traffic to Production Database¶
| Field | Value |
|---|---|
| ID | PM-007 |
| Date | 2025-04-22 |
| Severity | SEV-2 |
| Duration | 38m (detection to resolution) |
| Time to Detect | 23m |
| Time to Mitigate | 38m |
| Customer Impact | No direct customer-facing degradation; backend data integrity at risk for 23 minutes |
| Revenue Impact | none (direct); data recovery and audit cost estimated at $3,100 in engineering hours |
| Teams Involved | Platform Engineering, Data Engineering, Storefront Backend, Database Reliability |
| Postmortem Author | Kenji Watanabe |
| Postmortem Date | 2025-04-25 |
Executive Summary¶
On 2025-04-22, a Helm values refactoring effort accidentally introduced a production PostgreSQL connection string into the storefront-staging values file. When the staging environment's nightly integration test suite ran at 03:00 UTC, it executed destructive SQL operations — DELETE FROM and TRUNCATE TABLE — against live production tables. The issue was detected 23 minutes into the test run when a Database Reliability engineer noticed anomalous write activity in production monitoring dashboards. Two production tables (categories and cms_blocks) were truncated before the test suite was halted. Data was restored from a point-in-time snapshot within 15 minutes of detection. No customer-facing service experienced visible degradation during the incident.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-04-21 17:30 | Fatima Reyes begins Helm values refactoring: consolidating duplicate connection strings across staging, QA, and production values files into a shared _base.yaml |
| 2025-04-21 18:12 | Fatima copy-pastes POSTGRES_DSN from values-production.yaml into values-staging.yaml as a placeholder, intending to replace it with the staging DSN before committing |
| 2025-04-21 18:45 | Fatima is pulled into a P1 customer escalation call; does not return to the values refactoring |
| 2025-04-21 19:30 | Fatima commits and pushes the refactoring branch, including values-staging.yaml with the production DSN still in place; commit message: "chore: consolidate helm values base" |
| 2025-04-21 19:35 | PR is auto-approved by the platform-eng CODEOWNERS bot (file is not in a protected path) and merged to main |
| 2025-04-21 20:00 | ArgoCD syncs storefront-staging with the updated Helm values; staging pods restart with production POSTGRES_DSN |
| 2025-04-21 20:05 | No smoke test validates the connection string target; staging pods report healthy |
| 2025-04-22 03:00 | Nightly integration test suite begins on storefront-staging; first tests populate fixture data |
| 2025-04-22 03:02 | Test suite begins "teardown and rebuild" phase; issues DELETE FROM categories WHERE id > 0 against production DB |
| 2025-04-22 03:04 | TRUNCATE TABLE categories CASCADE executes; 2,847 category records deleted from production |
| 2025-04-22 03:06 | TRUNCATE TABLE cms_blocks CASCADE executes; 412 CMS block records deleted from production |
| 2025-04-22 03:08 | Test suite moves to table coupons; begins DELETE FROM coupons WHERE created_at < NOW() |
| 2025-04-22 03:12 | Database Reliability on-call (Oluwaseun Adeyemi) sees spike in n_dead_tup and autovacuum triggers in Grafana; unusual for 03:12 UTC on a Tuesday |
| 2025-04-22 03:18 | Oluwaseun queries pg_stat_activity; sees connection from storefront-staging app user hitting production cluster |
| 2025-04-22 03:20 | Oluwaseun kills staging connection, pages Incident Commander Kenji Watanabe, and posts in #incidents |
| 2025-04-22 03:23 | Kenji joins; confirms values-staging.yaml contains production DSN; test suite halted via ArgoCD manual sync suspend |
| 2025-04-22 03:25 | Fatima paged; confirms the copy-paste oversight |
| 2025-04-22 03:28 | Database Reliability initiates point-in-time recovery (PITR) for categories and cms_blocks from 02:58 UTC snapshot |
| 2025-04-22 03:35 | Staging values-staging.yaml patched with correct staging DSN; pushed directly to main |
| 2025-04-22 03:38 | PITR restore for both tables complete and validated; row counts match pre-incident baseline |
| 2025-04-22 03:40 | ArgoCD sync re-enabled for staging; pods restart against staging DB |
| 2025-04-22 03:45 | All-clear; data integrity confirmed by spot-check queries from Data Engineering |
| 2025-04-22 04:00 | Post-incident data audit begins; coupons table examined, no rows deleted (test was stopped mid-DELETE) |
Impact¶
Customer Impact¶
No customer-facing service experienced visible degradation. The categories and cms_blocks tables back the product catalog and CMS rendering pipeline, but both services have a 5-minute in-memory cache. The 38-minute incident occurred at 03:00 UTC, during the lowest-traffic window of the week (< 200 concurrent users), and the cache masked the missing data. Had the incident occurred during peak hours or extended beyond ~5 minutes post-cache-expiry, catalog pages would have returned empty results.
Internal Impact¶
Database Reliability spent 4 hours on PITR validation and data audit post-restore. Data Engineering spent 3 hours running integrity checks across all tables that the test suite touched (partial). Platform Engineering spent 2 hours patching and re-validating the Helm values refactoring. Total: approximately 9 engineer-hours. The nightly staging test suite was not re-run; a full QA sign-off was deferred by one day, delaying a planned storefront feature release.
Data Impact¶
Two production tables were truncated: categories (2,847 rows) and cms_blocks (412 rows). Both were fully restored from PITR. The coupons table received a partial DELETE that was stopped before any rows were committed (the connection was killed during the DELETE execution; PostgreSQL rolled back the uncommitted transaction). Data Engineering confirmed no other tables were modified. Write-ahead log (WAL) audit confirmed no data left the production database cluster.
Root Cause¶
What Happened (Technical)¶
The storefront-staging Helm chart separates connection strings by environment using distinct values files: values-staging.yaml, values-production.yaml, and values-qa.yaml. During a refactoring effort to deduplicate shared configuration into a _base.yaml, Fatima needed to reference the production DSN format as a template. She copy-pasted the literal production connection string — postgresql://storefront_app:REDACTED@prod-pg-cluster.internal:5432/storefront — into values-staging.yaml as a placeholder. An interruption prevented her from replacing it before committing.
The production DSN uses a dedicated application user (storefront_app) that has SELECT, INSERT, UPDATE, and DELETE privileges on all storefront schema tables, and TRUNCATE privilege on tables the application manages (including categories and cms_blocks). The staging application user has the same privilege set against the staging cluster. There is no network policy or database-level restriction preventing the staging Kubernetes namespace from reaching the production PostgreSQL cluster; both are within the same VPC and the security group allows inbound 5432 from any internal CIDR.
The nightly integration test suite runs in the storefront-staging namespace and assumes it is operating against a staging database. Its setup phase calls a reset_db() fixture that issues TRUNCATE TABLE on a hardcoded list of 23 tables, alphabetically ordered, before re-populating them with fixture data. This fixture has no guard that validates the target database hostname against an allowlist before executing destructive operations. Tables are processed alphabetically: categories and cms_blocks appear before coupons, orders, and payments.
Contributing Factors¶
-
Connection strings stored as plain text in values files: Production DSNs in
values-production.yamlare stored in cleartext, not as references to Kubernetes Secrets or external secret manager values (e.g., AWS Secrets Manager via External Secrets Operator). This means a copy-paste of a values file block inadvertently copies a live, working credential rather than a symbolic reference that would fail to resolve in the wrong context. -
No network policy separating staging from production database: The production PostgreSQL cluster accepts connections from the staging Kubernetes namespace because no NetworkPolicy or security group rule blocks cross-environment database traffic. An environment-isolated database network would have caused the connection to fail immediately, making the misconfiguration detectable before any data was modified.
-
Staging test suite does not use transaction wrapping for destructive operations: The
reset_db()fixture issues DDL (TRUNCATE) and DML (DELETE) without wrapping them in a transaction that could be rolled back if a pre-condition check fails. A guard of the formassert 'staging' in db_hostbefore any destructive operation would have raised an exception immediately and prevented any data modification.
What We Got Lucky About¶
- The test suite processes tables alphabetically. The incident was caught and stopped at the "C" tables (
categories,cms_blocks). Tables that begin with "O" and "P" (orders,payments,payment_methods) were never reached. Truncatingordersorpaymentswould have been a SEV-1 data loss event with significantly longer recovery time and potential regulatory implications. - The PITR backup taken at 02:58 UTC — just 2 minutes before the destructive operations began — was clean and complete. The restore was straightforward and took 10 minutes to execute and 5 minutes to validate. A longer backup interval or a corrupted snapshot would have escalated recovery complexity significantly.
Detection¶
How We Detected¶
Database Reliability engineer Oluwaseun Adeyemi was doing a routine 03:00 UTC shift handoff review of Grafana dashboards when he noticed n_dead_tup spiking on production tables and autovacuum triggering mid-cycle. This is anomalous at 03:12 UTC and prompted him to inspect pg_stat_activity, where he saw the storefront-staging application user connected to the production cluster. Detection was human and coincidental — no automated alert fired.
Why We Didn't Detect Sooner¶
No alert exists for cross-environment database connections. The production PostgreSQL cluster does not log or alert on connections from application users that originate from unexpected source namespaces or IPs. ArgoCD sync did not flag the values change as dangerous because it has no awareness of what connection strings point to. The PR review was automated (CODEOWNERS bot) because values-staging.yaml is not a protected file. No human reviewed the diff.
Response¶
What Went Well¶
- Oluwaseun's decision to immediately kill the staging connection rather than investigate further was correct and prevented additional tables from being truncated. Acting first and investigating second saved at minimum 5–10 minutes of additional data destruction.
- PITR was well-practiced. The Database Reliability team had run a PITR drill 6 weeks prior; the restore procedure was documented and executed without hesitation. The 10-minute restore time is within the team's SLA.
- Data Engineering's post-incident audit was thorough and completed within 4 hours, providing confidence that no other tables were affected.
What Went Poorly¶
- The PR that introduced the bad values file was auto-merged without human review. A configuration file containing a production database DSN should require mandatory human review, regardless of which path it lives in.
- There was no automated detection. The 23-minute detection window was entirely dependent on an engineer happening to look at dashboards. An alert for "staging application user connecting to production cluster" would have fired within seconds.
- Communication lag: Fatima was not paged until 03:25 — 5 minutes after root cause was confirmed. Earlier notification would not have changed the outcome but would have accelerated the values file patch.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-007-01 | Migrate all production DSNs from plaintext in values files to External Secrets Operator references; staging values must not be able to resolve production secret paths | P0 | Platform Engineering | Open | 2025-05-09 |
| AI-007-02 | Add NetworkPolicy blocking egress from staging namespace to production database CIDR on port 5432 | P0 | Platform Engineering | Open | 2025-05-02 |
| AI-007-03 | Add pre-condition guard in reset_db() fixture: assert target DB hostname matches *-staging* pattern before executing any TRUNCATE or DELETE |
P0 | Storefront Backend | Open | 2025-04-29 |
| AI-007-04 | Add CODEOWNERS rule requiring human approval for any change to files matching *values*.yaml that contains a hostname, DSN, or connection string |
P1 | Platform Engineering | Open | 2025-04-29 |
| AI-007-05 | Create Grafana alert: fire when a connection to production PostgreSQL originates from a source IP outside the production CIDR range | P1 | Database Reliability | Open | 2025-05-06 |
| AI-007-06 | Run tabletop exercise: simulate staging-hits-prod database scenario in the next on-call training cycle | P2 | Incident Command | Open | 2025-05-30 |
Lessons Learned¶
- Plaintext credentials in config files propagate silently: When a DSN is a symbolic reference to a secret store, copying the reference to the wrong file is immediately detectable (it won't resolve). When it's a literal connection string, the misconfiguration is invisible until the application successfully connects — to the wrong target.
- Alphabetical test ordering is not a safety net, but it saved us: The fact that critical tables (
orders,payments) appear late in the alphabet is coincidence. Tests that execute destructive operations must validate their execution environment explicitly, not rely on being stopped before they reach sensitive data. - Human detection is not a monitoring strategy: A 23-minute window of production data destruction was closed by an engineer glancing at a dashboard during a shift handoff. Cross-environment connection anomalies are machine-detectable within seconds and should generate automated alerts.
Cross-References¶
- Failure Pattern: Configuration Leak — Staging Inherits Production Credentials; Destructive Test Without Environment Guard
- Topic Packs: Kubernetes Networking and NetworkPolicy; Helm Values Management; External Secrets Operator; Database PITR
- Runbook:
runbooks/database/pitr-restore.md;runbooks/deployments/helm-values-audit.md - Decision Tree: Unexpected Production Writes → Check
pg_stat_activitySource → Identify Application User → Kill Connection → Scope Damage → PITR vs Row-Level Restore