Postmortem: Go Dependency Update Silently Changes Default Timeout — Caught in Canary¶

Field	Value
ID	PM-028
Date	2025-07-15
Severity	Near-Miss
Duration	0m (no customer impact)
Time to Detect	8m
Time to Mitigate	12m (auto-rollback)
Customer Impact	None
Revenue Impact	None
Teams Involved	Backend Engineering, Platform Engineering, SRE
Postmortem Author	Aleksei Voronov
Postmortem Date	2025-07-18

Executive Summary¶

On 2025-07-15 at 19:42 UTC, a canary deployment of the order-service at 5% traffic began showing an elevated API error rate driven by HTTP client timeouts on calls to the upstream inventory-service. The root cause was a minor version bump of the httpclient-go library (v1.4.2 to v1.5.0) which silently changed the default HTTP client timeout from 30 seconds to 5 seconds. The inventory-service has a p99 response latency of approximately 8 seconds under normal load, placing it well above the new default. The canary's automated rollback threshold (error rate > 2% for 5 consecutive minutes) triggered at 19:50 UTC, rolling back the canary pods before any traffic was promoted further. Had the canary step been removed — as the Backend Engineering team had proposed 2 weeks earlier in the name of deployment velocity — this change would have reached 100% of production traffic and caused roughly 12% of all API requests to fail.

Timeline (All times UTC)¶

Time	Event
14:30	Dependency PR #4417 merged: bumps `httpclient-go` from `v1.4.2` to `v1.5.0`; PR description notes "minor bump, no breaking changes per semver"; reviewers approve without reading upstream changelog
18:55	CI pipeline completes; Docker image `order-service:v3.47.0` built and pushed to registry
19:15	Deployment pipeline begins: staging deployment of `order-service:v3.47.0` completes; staging smoke tests pass (staging `inventory-service` stub has <100ms response time, so 5s timeout is never triggered)
19:42	Canary deployment starts: 5% of production traffic routed to pods running `v3.47.0`
19:44	First timeout errors appear in canary pods' logs: `context deadline exceeded` on calls to `inventory-service`
19:46	Error rate in canary cohort reaches 8.3%; PagerDuty alert fires: "order-service canary error rate > 2%" (SRE on-call: Jemima Thornton)
19:48	Jemima acknowledges alert; reviews canary dashboard in Grafana; sees timeout errors correlating precisely with `inventory-service` calls
19:50	Automated rollback threshold sustained for 5 minutes; Argo Rollouts initiates canary abort and scale-down
19:54	Canary pods fully drained; 100% of traffic on stable `v3.46.3`; error rate returns to baseline 0.1%
19:58	Jemima opens investigation; diffs `v3.47.0` vs `v3.46.3` go.sum and identifies `httpclient-go` version change
20:11	Aleksei Voronov (Backend Engineering lead) reads `httpclient-go` v1.5.0 changelog: "Changed default timeout to 5s for improved safety defaults" — listed under "improvements," not "breaking changes"
20:35	Hotfix PR #4419 opened: pins `httpclient-go` to `v1.4.2`, adds explicit `Timeout: 30*time.Second` in client initialization
21:10	Hotfix merged and deployed through full pipeline; canary completes without errors; full rollout proceeds
21:55	Incident closed; postmortem scheduled

Impact¶

Customer Impact¶

None — the canary carried only 5% of traffic, and the automated rollback restored stable behavior before any user-visible degradation could persist. The canary error rate was 8.3% for approximately 8 minutes, but this affected only 5% of traffic, and those requests received error responses rather than partial data. Users who experienced an error during this window would have seen a transient checkout failure and could retry successfully.

Internal Impact¶

Jemima Thornton (SRE on-call): ~1.5 hours (alert response, investigation, monitoring rollout of hotfix)
Aleksei Voronov (Backend Engineering): ~2 hours (root cause analysis, hotfix PR, postmortem authorship)
Lena Martínez (Backend Engineering, dependency PR author): ~1 hour (hotfix review, postmortem participation)
Platform Engineering review of canary configuration: ~1 hour
Total: approximately 5.5 engineering-hours

Data Impact¶

None. Timed-out requests returned errors to clients; no partial writes or data corruption occurred. The inventory-service was unaffected; requests simply failed on the client side before the upstream responded.

What Would Have Happened¶

If the canary step had been removed from the order-service deployment pipeline — as proposed in Backend Engineering's "Deploy Faster" initiative (proposal documented in RFC-0038, discussed but not adopted) — order-service:v3.47.0 would have been promoted directly to 100% of production traffic after the staging smoke tests passed. Staging uses a stubbed inventory-service with sub-100ms response times; the 5-second timeout is never triggered there. The regression would have reached production silently.

Under full production traffic, the 5-second timeout would have caused failures on all inventory-service calls with response times above 5 seconds. The inventory-service p99 latency is 8.2 seconds (driven by a known slow N+1 query that is tracked in backend-backlog but not yet fixed). Approximately 12% of order-service requests touch the inventory check path and wait for the slow query. This would have produced an immediate 12% API error rate on order placement — the core revenue-generating flow. Based on order volume and average order value, the revenue impact was estimated at $45,000 per hour by the Finance team's on-call analyst.

The symptoms would have been ambiguous at first: timeout errors on calls from order-service to inventory-service look identical to inventory-service being overloaded or down. SRE would likely have spent 20-40 minutes investigating the inventory service before correlating the incident with the deployment. A deployment rollback would then have been straightforward, but the mean time to mitigate from first alert to rollback complete would likely have been 35-60 minutes — representing $26K-$45K in lost revenue during the incident window, plus customer trust impact on the checkout flow.

Root Cause¶

What Happened (Technical)¶

The httpclient-go library is a thin wrapper around Go's net/http package used across several Helix services to standardize middleware (logging, tracing, retry logic). Version v1.5.0 was released by its maintainers on 2025-07-10 with the following changelog entry:

Improvements - Changed default timeout from 30s to 5s to prevent hung goroutines in high-concurrency environments. Existing code that relies on the 30s default should set Timeout explicitly.

This is a behaviorally breaking change despite being released as a minor version, violating semver conventions. The library's maintainers classified it as an "improvement" and did not bump the major version or add a deprecation notice in the previous minor release.

The dependency bump PR (#4417) was opened by Lena Martínez as part of a routine weekly dependency update pass. The PR description was auto-generated by Dependabot and stated "Minor version bump — no breaking changes." Lena and the two reviewers did not consult the upstream changelog. The library's test suite, which runs in CI, exercises the httpclient-go client against mock servers with zero-latency responses; the 5-second timeout was never approached.

The staging environment's inventory-service is a stub that returns hardcoded responses in under 10 milliseconds. The 5-second timeout regression is structurally invisible to staging. This is a class of regression — behavior that changes only under realistic latency conditions — that staging environments routinely fail to catch.

Contributing Factors¶

Semver minor bump treated as safe by convention: Both automated tooling (Dependabot) and human reviewers operated under the assumption that a v1.x.y → v1.x+1.0 bump is non-breaking. This assumption is frequently wrong in practice, especially for libraries that control infrastructure behavior like timeouts, retries, and connection pooling. No process existed to require changelog review for minor bumps.
Staging uses fast stubs, not realistic upstream simulators: The inventory-service stub used in staging was built for functional correctness testing, not for performance envelope testing. It cannot reproduce the p99 latency characteristics of the real upstream. Any regression that manifests only above a latency threshold will pass staging cleanly.
HTTP client timeout was implicit rather than explicit: The order-service code relied on the library's default timeout rather than setting Timeout explicitly in the client initialization. Relying on library defaults for critical parameters (timeouts, retry counts, buffer sizes) creates invisible coupling to library version behavior.

What We Got Lucky About¶

The canary step was in place and was not removed. Two weeks before this incident, Backend Engineering proposed removing the canary step from the order-service deployment pipeline (RFC-0038) to reduce deployment cycle time by approximately 25 minutes. The proposal was tabled after SRE raised concerns. If it had been adopted, the regression would have gone straight to 100% of production traffic. This near-miss is direct evidence for why the canary step exists.
The slow upstream service (inventory-service) had a latency that fell neatly between the old and new timeout. The p99 of 8.2 seconds is above the new 5-second default but well below the old 30-second default. If the upstream had been faster (p99 < 5s), the regression would have been silent even in the canary. If it had been slower (p99 > 30s), the issue would have already been visible before this incident. The latency profile made the regression detectable.

Detection¶

How We Detected¶

The canary deployment's automated error rate monitoring in Argo Rollouts triggered a PagerDuty alert when the canary cohort's error rate exceeded 2% for 5 consecutive minutes. SRE on-call Jemima Thornton reviewed the Grafana dashboard, identified the timeout errors, and confirmed correlation with the inventory-service call path. Automated rollback was already in progress when Jemima began her investigation.

Why This Almost Wasn't Caught¶

The regression is structurally invisible to the staging test environment. If the canary step had been removed (as proposed), no automated gate would have caught this before full production rollout. The only other detection path would have been an engineer manually reading the httpclient-go v1.5.0 changelog during PR review — a step that was not part of the dependency update process.

Response¶

What Went Well¶

The automated canary rollback worked exactly as designed. No human intervention was required to roll back the bad deployment; the system detected and remediated the regression autonomously within the canary window.
Root cause identification was fast (17 minutes from rollback to confirmed root cause) because the error messages (context deadline exceeded with stack trace pointing to the httpclient-go call site) made the failure mode clear, and the recent dependency bump was an obvious candidate.

What Could Have Gone Better¶

The dependency bump PR had two approvals and zero changelog review. For libraries that own infrastructure-level behavior, "minor bump" does not mean "safe to apply without reading the diff." The review process had no explicit requirement to consult the changelog.
The order-service used an implicit timeout default. Explicit configuration of all critical client parameters — with a comment explaining the chosen value — would have made this class of regression impossible: the timeout would have been pinned in application code regardless of library defaults.

Action Items¶

ID	Action	Priority	Owner	Status	Due Date
PM028-01	Add lint rule (via `go vet` custom analyzer or golangci-lint) that flags `httpclient-go.NewClient()` calls without explicit `Timeout` field set	P0	Aleksei Voronov	In Progress	2025-07-22
PM028-02	Update dependency update PR template to require: "(a) link to upstream changelog, (b) confirmation that no behavior defaults changed" for all library bumps touching networking, timeouts, or connection handling	P1	Lena Martínez	Open	2025-07-25
PM028-03	Close RFC-0038 (canary removal proposal) with documented rationale citing PM-028	P0	Platform Engineering	Completed	2025-07-16
PM028-04	Add a latency-injection test in staging that simulates `inventory-service` p95/p99 latencies; include in pre-deployment gate	P1	SRE	Open	2025-08-05
PM028-05	Audit all services for implicit use of `httpclient-go` default timeout; pin explicit values in each	P1	Backend Engineering	Open	2025-07-30
PM028-06	File upstream issue with `httpclient-go` maintainers requesting proper semver major bump for timeout default change	P2	Aleksei Voronov	Open	2025-07-22

Lessons Learned¶

"Minor version" does not mean "no behavior change." Library maintainers frequently release behavior-changing defaults as minor or patch versions. Dependency review processes must explicitly check changelogs for behavioral changes, not defer to semver labels, especially for libraries that control infrastructure-level behavior like timeouts and connection settings.
Implicit defaults are invisible coupling. When application code relies on a library's default value for a critical parameter, it is silently coupled to that library's versioned behavior. Explicit configuration — even if it matches the current default — documents the intent and prevents silent regression on upgrade.
Canary deployments catch a specific class of regression that staging cannot. Regressions that manifest only under realistic traffic patterns, upstream latency distributions, or data volumes are structurally invisible to staging. The canary step is not a formality; it is the only automated gate that exposes this class of bug under real conditions.

Cross-References¶

Failure Pattern: Silent Behavioral Change in Dependency Upgrade / Staging Environment Blind Spot
Topic Packs: Progressive Delivery, Canary Deployments, Dependency Management, Go HTTP Client Configuration
Runbook: DEPLOY-RB-005 — Canary Rollback Procedure; DEPLOY-RB-009 — Dependency Regression Investigation
Decision Tree: Deployment Triage → Canary Error Rate Spike → Correlated with recent dependency bump? → Yes → Rollback, diff go.sum, audit changelog