Postmortem: Stale Docker Base Image Ships Known CVE to Production¶
| Field | Value |
|---|---|
| ID | PM-020 |
| Date | 2025-10-29 |
| Severity | SEV-3 |
| Duration | 14 days exposure; 12m to mitigate once discovered |
| Time to Detect | 14 days (detected by weekly Trivy scan in staging, not CI) |
| Time to Mitigate | 12m (base image update, rebuild, push to production) |
| Customer Impact | None — CVE was present but not exploitable in this application's code paths |
| Revenue Impact | None — compliance audit finding opened; no fines or SLA penalties |
| Teams Involved | Platform Engineering, Security Engineering, Backend Engineering (Catalog Squad), DevOps |
| Postmortem Author | Simone Bertolucci |
| Postmortem Date | 2025-11-03 |
Executive Summary¶
On 2025-10-15, catalog-service was built and deployed to production using a Docker base image pinned to python:3.11.4-slim. CVE-2025-12345, a critical remote code execution vulnerability in libexpat (the XML parsing library), was published on 2025-10-15 — the same day as the deploy. The vulnerability was not caught in the build pipeline because the CI system had no container image scanning step. The weekly Trivy scan running in the staging environment detected the CVE on 2025-10-29, 14 days after the vulnerable image entered production. The base image was updated to python:3.11.9-slim, the service was rebuilt and redeployed within 12 minutes. No evidence of exploitation was found; application-layer analysis confirmed the vulnerable libexpat XML parsing code path was never called by catalog-service. A compliance audit finding was opened and subsequently closed after remediation was documented.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 2025-10-15 09:02 | CVE-2025-12345 published in NVD: critical (CVSS 9.8) remote code execution via heap buffer overflow in libexpat ≤ 2.5.0 when parsing specific malformed XML entity sequences |
| 2025-10-15 11:14 | catalog-service v3.7.0 built in CI using python:3.11.4-slim as base image (pinned in Dockerfile); libexpat 2.4.9 present in image — vulnerable |
| 2025-10-15 11:22 | CI build passes all checks (unit tests, integration tests, SAST scan on Python source); no container scanning step exists |
| 2025-10-15 11:45 | catalog-service v3.7.0 deployed to production via standard CD pipeline; vulnerable image now serving production traffic |
| 2025-10-16 – 2025-10-28 | No detection signal. CVE-2025-12345 is not in CI pipeline. Production container scanning does not exist. Weekly staging Trivy scan runs on Wednesdays. |
| 2025-10-29 03:00 | Weekly Trivy scan runs against all staging container images |
| 2025-10-29 03:47 | Trivy scan completes; generates report including: catalog-service:3.7.0 — CVE-2025-12345 (CRITICAL) — libexpat 2.4.9 → fixed in 2.5.0 |
| 2025-10-29 08:15 | Security engineer Nadia Petrossian reads the Trivy report during morning triage; sees CRITICAL CVE in production-mirrored image |
| 2025-10-29 08:22 | Nadia confirms catalog-service v3.7.0 is also running in production (same image digest); pages Security Engineering lead Kwame Asante |
| 2025-10-29 08:25 | Kwame opens SEV-3 incident; pages Backend Engineering (Catalog Squad) lead Orla Brennan and DevOps on-call Simone Bertolucci |
| 2025-10-29 08:31 | Orla pulls CVE details: vulnerability requires application to call XML_Parse() or XML_ParseBuffer() with attacker-controlled input |
| 2025-10-29 08:36 | Orla confirms catalog-service does not use XML parsing anywhere in its codebase; primary data formats are JSON and Protobuf — CVE is present but not exploitable |
| 2025-10-29 08:40 | Decision: treat as high-urgency but not actively exploitable; remediate immediately via base image update; no emergency rollback needed |
| 2025-10-29 08:42 | Simone updates Dockerfile in catalog-service repo: FROM python:3.11.4-slim → FROM python:3.11.9-slim; opens PR |
| 2025-10-29 08:47 | PR approved by Orla after confirming no dependency breaks; CI pipeline triggered |
| 2025-10-29 08:50 | CI build completes; Trivy scan on new image confirms CVE-2025-12345 no longer present (libexpat 2.5.0 in python:3.11.9-slim) |
| 2025-10-29 08:54 | catalog-service v3.7.1 deployed to production |
| 2025-10-29 08:56 | Production deployment verified healthy; old pods terminated; CVE no longer present in production |
| 2025-10-29 09:10 | Kwame opens compliance audit finding in Jira (required by security policy for CRITICAL CVEs in production); documents 14-day exposure window and non-exploitability analysis |
| 2025-10-29 09:30 | Incident declared resolved; postmortem scheduled for 2025-11-03 |
| 2025-11-03 14:00 | Postmortem meeting held |
Impact¶
Customer Impact¶
None. CVE-2025-12345 requires an attacker to supply malformed XML input to the libexpat parsing functions XML_Parse() or XML_ParseBuffer(). catalog-service does not call these functions — its entire data exchange layer uses JSON (via the requests library and FastAPI's built-in JSON serialization) and Protobuf (via grpcio). Neither of these libraries exercises the vulnerable libexpat code path. An attacker with network access to catalog-service had no reachable vector to trigger the CVE. This was confirmed by Orla's code review and a subsequent automated static analysis run (CodeQL) that found zero call sites for XML parsing functions in the service.
Internal Impact¶
- 14 days of a CRITICAL CVE present in a production container image (even if non-exploitable, this is a compliance and audit finding)
- Nadia Petrossian: ~2 hours of triage, impact assessment, and compliance documentation
- Kwame Asante: ~2 hours of incident coordination, security review, and audit finding management
- Orla Brennan: ~1.5 hours of code analysis and remediation
- Simone Bertolucci: ~1 hour for Dockerfile update, pipeline, and deployment
- Compliance audit finding opened in Jira; required documentation of exposure window, exploitability analysis, and remediation evidence — estimated 4 additional hours of documentation work
- Security Engineering team's sprint was partially disrupted; 2 planned security review tasks were deferred to the next sprint
Data Impact¶
None. No evidence of exploitation. Cloudflare WAF logs, application access logs, and network flow logs for the 14-day exposure window were reviewed by the Security Engineering team; no anomalous XML-formatted requests were directed at catalog-service endpoints during this period.
Root Cause¶
What Happened (Technical)¶
The catalog-service Dockerfile pins its base image to a specific tag: FROM python:3.11.4-slim. This is a deliberate reproducibility choice — pinning to a specific tag ensures that every CI build produces an identical base layer, preventing "it worked yesterday" failures caused by upstream image changes. However, python:3.11.4-slim is not an immutable reference: the tag can be updated by Docker Hub. More importantly, by pinning to a specific version rather than a rolling tag (like python:3.11-slim), the team accepted the responsibility of manually updating the pin when upstream releases contain security fixes. No corresponding update policy or review cadence was documented.
CVE-2025-12345 was published on the same day as the catalog-service v3.7.0 build. Even if CI had been scanning images at build time, this specific build might have escaped detection if the NVD database had not been updated yet when the scan ran. However, the absence of any CI scanning meant there was zero opportunity to catch CVEs that were known at build time. The NVD entry for CVE-2025-12345 was live by 09:02 UTC; the build ran at 11:14 UTC — a 2-hour window in which a scanner with a fresh database update would have caught it.
The staging Trivy scan was configured to run weekly rather than per-deploy. This is a cost optimization decision made in Q2 2025, when the security team estimated that weekly scans were sufficient given the team's manual code review practices. The decision was recorded in a Jira ticket but never revisited when the team's deployment frequency increased from approximately 3 deploys per week to 15–20 deploys per week in Q3 2025. The mismatch between scan frequency and deploy frequency created a detection window of up to 7 days even for the staging scan, and the production environment had no scanning at all.
Contributing Factors¶
-
No container image scanning in the CI pipeline: The CI pipeline for
catalog-service(and most other services) performs Python SAST (Bandit), dependency vulnerability scanning (Safety for Python packages only), and unit/integration tests. It does not scan the full container image — including OS-level packages likelibexpat— against CVE databases. Container scanning was listed on the security team's Q3 roadmap but was not implemented before the incident. -
Base image pinning without an update policy: Pinning
python:3.11.4-slimprovides build reproducibility but creates a policy gap: someone must decide when to update the pin, how often to check for upstream security releases, and what triggers an unscheduled update. None of this was documented. The image had been at3.11.4for 4 months at the time of the incident. -
Staging Trivy scan frequency did not match deploy frequency: The weekly scan was sized for a deployment cadence that no longer applied. A per-deploy scan (triggered as a CI step) or a daily scan on the staging registry would have reduced the maximum detection window from 7 days to hours.
What We Got Lucky About¶
-
catalog-servicehappens not to use XML parsing. The CVE was critical and remotely exploitable in the general case, but the specific code paths it required were never present in this application. Had the service been an XMPP gateway, a SOAP API client, or an RSS feed processor, the CVE would have been actively exploitable and the incident would have been a SEV-1 data breach investigation rather than a SEV-3 compliance finding. -
The compliance audit finding was opened and closed within the same sprint. The security team's non-exploitability analysis (code review + CodeQL) was accepted by the compliance team as sufficient evidence that no customer data was at risk, avoiding a protracted investigation.
-
The base image update from
3.11.4-slimto3.11.9-slimintroduced no breaking changes. Python minor version updates within the same minor series are generally backwards compatible, but there is always a risk of dependency breakage. The CI test suite passed cleanly on the new base image, and no runtime issues emerged after the production deploy.
Detection¶
How We Detected¶
Detection was by the weekly Trivy scan running in the staging environment, which scanned container images that mirror the production registry. Trivy's database was up to date at the time of the scan and correctly identified libexpat 2.4.9 in python:3.11.4-slim as matching CVE-2025-12345. The scan ran at 03:00 UTC and was reviewed by Nadia at 08:15 UTC during morning triage.
Why We Didn't Detect Sooner¶
Three compounding factors delayed detection by 14 days: (1) there was no CI-integrated container scanning step that would have caught the CVE on the day of the build; (2) the staging Trivy scan ran weekly rather than per-deploy or daily; and (3) there was no production-environment scanning of any kind. The staging scan was the last (and only) line of detection, and its weekly frequency determined the maximum detection window.
Response¶
What Went Well¶
- The end-to-end remediation time from Nadia's identification to the vulnerable image being removed from production was 41 minutes — this is an excellent response time for a container image CVE and reflects well-practiced CD pipeline capabilities.
- Orla's exploitability analysis (confirming no XML parsing call sites) was fast and thorough. The team used both manual code review and an automated CodeQL query, providing two independent confirmations.
- The new base image (
3.11.9-slim) was verified clean by Trivy before being deployed — the CI pipeline now had a Trivy step added ad hoc for this build, previewing the future AI-020-01 action item.
What Went Poorly¶
- 14 days is an unacceptable detection window for a CRITICAL CVE. The absence of CI-level container scanning is the primary gap.
- The base image pinning policy (or lack thereof) was a known risk that had been discussed in security reviews but never acted on. The gap between "identified risk" and "implemented control" was 4+ months.
- The weekly staging scan schedule was not updated when deployment frequency tripled in Q3 2025. Operational context changes (higher deploy frequency) should trigger a review of detection controls.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-020-01 | Add Trivy container image scan as a required CI step for all services; fail the build on CRITICAL or HIGH severity CVEs with no documented exception | Critical | Simone Bertolucci | Open | 2025-11-14 |
| AI-020-02 | Implement automated base image update policy: Dependabot or Renovate configured for all Dockerfile base image pins; auto-PR on upstream patch releases; require merge within 7 days for security releases |
High | Platform Engineering | Open | 2025-11-21 |
| AI-020-03 | Change staging Trivy scan frequency from weekly to daily; add production registry scan (daily) against current running image digests | High | Nadia Petrossian | Open | 2025-11-14 |
| AI-020-04 | Define and document base image update SLA policy: CRITICAL CVE → update within 24h; HIGH CVE → update within 7 days; MEDIUM/LOW → update within next scheduled sprint | High | Kwame Asante | Open | 2025-11-10 |
| AI-020-05 | Close the compliance audit finding with documented evidence: exploitability analysis, remediation timeline, and new scanning controls | Medium | Kwame Asante | In Progress | 2025-11-07 |
| AI-020-06 | Evaluate image digest pinning (FROM python:3.11.9-slim@sha256:<digest>) instead of tag pinning for stronger reproducibility guarantees that survive tag mutation |
Low | Simone Bertolucci | Open | 2025-11-28 |
Lessons Learned¶
-
Reproducibility and security are both constraints on base image pinning, and they require explicit policy to coexist: Pinning to a specific image tag is good for reproducibility. It becomes a liability if there is no corresponding policy for when and how often to update the pin. The pin and the update cadence are a package deal — one without the other creates a false sense of security.
-
Scan frequency must track deploy frequency: A weekly scan was sized for a world where the team deployed 3 times per week. When deployment frequency increased 5x, the detection window didn't automatically shrink — it only shrinks if the scan is triggered per deploy or if scan frequency is explicitly re-evaluated when operational context changes. Treat scan frequency as a living operational decision, not a one-time configuration.
-
"Not exploitable in our context" is a mitigating factor, not a justification for slow response: The non-exploitability of CVE-2025-12345 in
catalog-servicewas confirmed after careful analysis. It reduced the urgency from "stop everything" to "fix within the hour." However, this analysis took 7 minutes of expert review — it is not something that can be assumed without checking. Every CRITICAL CVE in a production image must go through an exploitability assessment, and that assessment takes time. The lesson is to prevent CVEs from reaching production, not to rely on fortuitous non-exploitability.
Cross-References¶
- Failure Pattern: Process gap / stale dependency; detection frequency mismatch; known risk without implemented control
- Topic Packs: Container security, CVE management, CI/CD pipeline security, Docker base image lifecycle, SBOM and vulnerability scanning
- Runbook:
runbooks/security/cve-in-container-image-response.md - Decision Tree: Triage → CVE detected in container image → confirm image is in production → exploitability analysis (does the app call the vulnerable code path?) → if exploitable: SEV-2 escalation; if not exploitable: SEV-3; patch base image → rebuild → redeploy → rescan to confirm remediation → open compliance finding if CRITICAL exposure > 24h