Skip to content

Postmortem: Debug Build Deployed to Production via Copy-Paste Error

Field Value
ID PM-006
Date 2025-03-11
Severity SEV-2
Duration 47m (detection to resolution)
Time to Detect 18m
Time to Mitigate 47m
Customer Impact ~4,200 active users experienced 3x latency increase (avg 1.8s → 5.6s); checkout flow degraded for 29 minutes
Revenue Impact ~$8,400 estimated (abandoned carts during latency window, 14:22–14:51 UTC)
Teams Involved Platform Engineering, Payments Backend, Security, Incident Command
Postmortem Author Danielle Howarth
Postmortem Date 2025-03-13

Executive Summary

On 2025-03-11, an engineer on the Platform Engineering team copy-pasted a Helm deployment command from an earlier debug session into a production deploy, inadvertently setting env.LOG_LEVEL=debug and env.TRACE_ALL=true on the payments-api service. Verbose logging caused /var/log on the affected nodes to fill within 30 minutes, driving I/O wait that tripled request latency across the payments checkout flow. Security review confirmed that PII including user email addresses and session tokens appeared in the debug log output. The incident was resolved before logs were shipped to the centralized logging cluster, limiting the PII exposure blast radius to local disk.

Timeline (All times UTC)

Time Event
13:54 Tariq Osei begins debug session on staging payments-api, uses --set env.LOG_LEVEL=debug --set env.TRACE_ALL=true to chase a session expiry bug
14:02 Debug session concludes; Tariq moves on to production deploy for a separate, unrelated config change (connection pool size)
14:07 Tariq copy-pastes deploy command from terminal history; does not notice debug flags are still present
14:08 helm upgrade payments-api ./charts/payments-api --set config.pool_size=20 --set env.LOG_LEVEL=debug --set env.TRACE_ALL=true runs successfully in production
14:09 Rolling deploy completes; all 6 replicas running with debug logging enabled
14:11 Log volume on payments-api pods increases 40x; JSON log lines now include full request/response bodies, session tokens, and user email addresses
14:18 Synthetic monitoring detects P99 latency breach on /api/v2/checkout (threshold: 3s; observed: 4.1s); PagerDuty alert fires
14:20 On-call engineer Priya Nambiar acknowledges alert, begins investigation
14:22 Priya opens Datadog dashboard; sees CPU normal, memory normal, disk I/O wait spiking to 94% on payments nodes
14:26 Priya SSHs to payments-node-04, runs df -h; /var/log at 81% capacity, growing visibly
14:28 Priya checks kubectl describe pod payments-api-5f8b7d-xxz; notices LOG_LEVEL=debug in environment section
14:29 Priya pages Tariq and posts in #incidents: "payments-api is running with debug flags in prod, disk filling, investigating rollback"
14:31 Tariq joins call; confirms the copy-paste error; Incident Commander Danielle Howarth joins
14:33 Decision made: roll back Helm release rather than patch forward to avoid another deploy error
14:36 Security team (Marcus Delacroix) notified of potential PII in logs; begins log content review
14:38 Rollback command issued: helm rollback payments-api 1
14:41 All 6 replicas redeployed; LOG_LEVEL returns to info, TRACE_ALL absent
14:44 Disk I/O wait drops; P99 latency returns to baseline (~1.8s)
14:51 Marcus confirms: debug logs are on local disk only; centralized log shipper (Fluentd) has a 25-minute lag and had not yet ingested the debug log volume
14:55 Node disk logs purged via retention policy enforcement (logrotate -f); PII exposure window closed
15:10 All-clear declared; incident downgraded to post-incident review

Impact

Customer Impact

Approximately 4,200 concurrent users in the checkout funnel experienced latency increases of 3x (avg P50: 1.8s → 5.6s, P99: 3.1s → 9.8s) for a 29-minute window (14:22–14:51 UTC). Cart abandonment metrics show a 34% spike relative to the rolling 7-day Tuesday average. No checkout failures (HTTP 5xx) occurred; users experienced slowness rather than errors.

Internal Impact

Incident response consumed approximately 6 engineer-hours across Platform Engineering (Tariq, Priya), Security (Marcus), and Incident Command (Danielle). The planned 15:00 UTC deployment freeze for the quarterly billing release was pushed to the following day, requiring coordination with the Finance Engineering team.

Data Impact

PII (user email addresses, session tokens) was written to debug logs on 6 production Kubernetes nodes. The exposure was contained to local /var/log on those nodes and was purged before centralized shipping occurred. No evidence of external access to log files. Security team assessed this as a near-miss requiring a formal PII Incident Report per the company's data handling policy, even though exfiltration did not occur.

Root Cause

What Happened (Technical)

The payments-api Helm chart accepts arbitrary environment variables via --set env.* overrides. This is a common pattern in the organization for one-off config adjustments. There is no allowlist or denylist enforced on which environment variables can be set at deploy time, and no diff review is required for helm upgrade commands run manually from a developer's workstation.

When Tariq ran the production deploy, his shell history contained the full debug-session command from 13:54. The two commands were visually similar except for the --set config.pool_size=20 flag, and the debug flags appeared at the end where they were easy to miss in a long command line.

The TRACE_ALL=true flag in payments-api enables structured logging of the full HTTP request and response, including headers (Authorization, Cookie) and body fields. Combined with LOG_LEVEL=debug, every transaction through the payments flow was written verbosely to stdout, which the container runtime directed to /var/log/containers/ on the host. At peak checkout traffic (~700 rps), each log line averaged 4 KB. This produced approximately 2.8 MB/s of log data per replica, or ~16.8 MB/s across all 6 replicas — overwhelming the normal log rotation schedule.

Disk I/O saturation on the node caused write queues to back up. The payments-api application uses synchronous structured logging (zerolog with default synchronous writer). With I/O wait at 94%, log writes began blocking request handler goroutines. Since the application does not have a log write timeout or async log buffer, handler threads stalled waiting for log flushes, which manifested as request latency tripling.

The PII exposure occurred because TRACE_ALL=true was designed for a sandboxed debug environment where no real user data flows. The flag's behavior was not documented in the Helm chart values file with a security warning, and no code review blocked its use in production.

Contributing Factors

  1. No environment-specific deploy protection in CI: Manual helm upgrade commands executed from developer workstations are not gated through the CI pipeline. Had the deploy gone through CI, the pipeline's environment variable diff check would have flagged LOG_LEVEL changing from info to debug. The bypass exists intentionally for "hotfix" scenarios but is used far more broadly in practice.

  2. Debug flags not gated behind feature flags or environment guards: TRACE_ALL and LOG_LEVEL=debug are plain environment variables with no runtime guard that prevents them from functioning in production namespaces. A guard checking ENV != production or requiring an additional secret to activate trace logging would have been a safety net.

  3. Log volume alerts set too high: The disk utilization alert threshold for payment nodes was 90% (PagerDuty) with a 5-minute evaluation window. By the time it would have fired, /var/log would have been completely full. The latency alert (synthetic monitoring) caught the incident at 81% disk usage, but only incidentally — it was not a log volume alert at all.

What We Got Lucky About

  1. The centralized logging cluster (Elasticsearch via Fluentd) has a known 20–30 minute ingestion lag during peak hours. The debug logs were on local disk for only 25 minutes before the rollback and purge. Had the lag been shorter, PII would have been ingested and indexed in a system with broader access.
  2. The logrotate retention policy for /var/log/containers/ runs hourly and was due in 5 minutes at the time of purge. Even without manual intervention, the most sensitive lines would have rotated off. The manual purge simply ensured certainty.

Detection

How We Detected

Synthetic monitoring (Checkly) runs a simulated checkout flow every 60 seconds from three external regions. The P99 latency threshold (3s) was breached at 14:17; the alert fired at 14:18 after one confirmation interval. This was the primary detection mechanism.

Why We Didn't Detect Sooner

The deploy itself completed successfully with exit code 0 and no Helm diff review. There is no post-deploy smoke test that validates environment variable contents. The 18-minute gap between deploy (14:08) and detection (14:18 alert, 14:20 acknowledgement) reflects the time for disk I/O contention to build up across all 6 replicas completing their rolling restarts and accumulating log volume. Earlier detection would have required either a deploy-time diff alert or a log volume alert with a lower threshold.

Response

What Went Well

  1. On-call engineer Priya correctly identified disk I/O as the proximate cause within 8 minutes of acknowledging the alert, without getting distracted by CPU or memory (which were both normal). The structured investigation checklist in the on-call runbook guided the right sequence.
  2. The decision to roll back rather than patch forward was made quickly (2 minutes from root cause identification). Past incidents have shown that patch-forward deploys under pressure introduce secondary errors.
  3. Security was notified within 8 minutes of confirming debug flags were active. This was fast enough to assess the Fluentd lag and confirm no ingestion had occurred.

What Went Poorly

  1. There was no validation step between identifying the faulty deploy and executing the rollback. Danielle had to verbally confirm with Tariq that the previous Helm revision was safe to roll back to — this added 3 minutes of unnecessary delay. A "last known good" marker on Helm releases would eliminate this uncertainty.
  2. The PII exposure was not classified as a formal incident until 14:51, which is after the all-clear. The Security team should be looped in earlier (at first suspicion of sensitive data in logs) rather than after root cause is confirmed.
  3. The Slack #incidents channel notification at 14:29 did not auto-page the Security on-call. Marcus joined because he happened to be reading Slack. The incident management tool (PagerDuty) has a "security escalation" policy that was not triggered.

Action Items

ID Action Priority Owner Status Due Date
AI-006-01 Add CI pipeline requirement for all production helm upgrade commands; remove workstation bypass except for break-glass scenarios P0 Tariq Osei Open 2025-03-21
AI-006-02 Add environment guard in payments-api: if TRACE_ALL=true and ENV=production, log a warning and disable the flag P0 Payments Backend Open 2025-03-18
AI-006-03 Lower disk utilization alert threshold for log volumes to 60%, with 2-minute evaluation window P1 Platform Engineering Open 2025-03-17
AI-006-04 Add PII-in-logs auto-escalation to Security on-call when any production pod runs with LOG_LEVEL=debug or TRACE_ALL=true P1 Security / Platform Open 2025-03-28
AI-006-05 Document TRACE_ALL flag in Helm values file with a # SECURITY: do not set in production comment and link to PII policy P2 Payments Backend Open 2025-03-21
AI-006-06 Add "last known good" Helm revision annotation to all production releases post-deploy, surfaced in runbook P2 Platform Engineering Open 2025-04-04

Lessons Learned

  1. Terminal history is a deploy hazard: Long-lived shell sessions that span debug and production work create a class of copy-paste errors that are visually indistinguishable from correct commands. Structured deploy tooling (not raw helm upgrade) with mandatory argument review eliminates this class entirely.
  2. Permissive flag surfaces in Helm charts need security annotations: A flag that writes PII to logs when enabled is a security control, not just a debug convenience. Its values file entry should be treated like a secret — annotated, audited, and blocked from production by default.
  3. Latency detection is an unreliable proxy for log volume problems: The incident was caught by a latency alert, not a log alert. Log volume should be a first-class signal with its own alerting, separate from application health metrics, because log I/O contention can cause latency symptoms that mislead investigators toward application bugs.

Cross-References

  • Failure Pattern: Human Error — Copy-Paste Deploy; Debug Configuration in Production
  • Topic Packs: CI/CD Pipeline Hardening; Helm Chart Security; PII in Logs; On-Call Runbook Design
  • Runbook: runbooks/deployments/helm-rollback.md; runbooks/security/pii-in-logs-response.md
  • Decision Tree: Deploy → Unexpected Behavior → Check Pod Environment Variables → Check Disk I/O → Roll Back vs Patch Forward