Postmortem: Expired Wildcard TLS Certificate Causes Full API Gateway Outage¶
| Field | Value |
|---|---|
| ID | PM-002 |
| Date | 2025-07-09 |
| Severity | SEV-1 |
| Duration | 2h 37m (detection to resolution) |
| Time to Detect | 6m |
| Time to Mitigate | 2h 37m |
| Customer Impact | 100% of HTTPS API traffic returned TLS handshake errors for 2h 37m. Approximately 83,000 customers were unable to use web or mobile applications. Mobile clients experienced additional degraded recovery of 40–90m after cert renewal due to TLS session caching. |
| Revenue Impact | ~$186,000 estimated (2.6h × avg $71k/h API-dependent transaction volume) |
| Teams Involved | Infrastructure Engineering, Platform Security, SRE, Mobile Engineering, Customer Success |
| Postmortem Author | Desiree Kamara (Senior SRE) |
| Postmortem Date | 2025-07-13 |
Executive Summary¶
At 03:12 UTC on 2025-07-09, the wildcard TLS certificate for *.api.meridiancloud.io expired. No monitoring existed for certificate expiry. All inbound HTTPS traffic to the API gateway cluster returned TLS handshake failures, rendering the web application, mobile apps, and all third-party integrations completely non-functional. The certificate had been issued manually two years prior by an engineer who subsequently left the company, and no rotation reminder had been configured. The new certificate was issued and deployed within 2 hours 31 minutes, but mobile clients continued to experience errors for an additional 40–90 minutes due to OS-level TLS session caching. The incident occurred at low-traffic hours (03:12 UTC), which limited the peak customer count exposed.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 03:00 | Certificate *.api.meridiancloud.io expires (valid 2023-07-09 → 2025-07-09 00:00 UTC). No alert fires. No automated renewal. |
| 03:06 | Synthetic monitoring canary (Pingdom, HTTP check on api.meridiancloud.io/health) fails with SSL: CERTIFICATE_VERIFY_FAILED. PagerDuty fires: [CRITICAL] API health check failing. |
| 03:09 | On-call SRE (Kwame Asante) acknowledges alert. Runs curl -I https://api.meridiancloud.io/health. Receives: curl: (60) SSL certificate problem: certificate has expired. |
| 03:11 | Kwame checks Datadog — API gateway request volume drops to zero at 03:00 UTC. Pages Infrastructure Engineering on-call (Yuki Tanabe). SEV-1 declared. War room opened. |
| 03:14 | Yuki joins. Initial hypothesis: Nginx configuration change from evening deployment. Checks deployment log — last deployment was 19h ago. |
| 03:17 | Yuki runs openssl s_client -connect api.meridiancloud.io:443 2>&1 | grep -A5 'Certificate:'. Output confirms expiry date 2025-07-09. Root cause confirmed: expired cert. |
| 03:19 | Platform Security on-call (Fatima Al-Rashid) paged. Begins process for emergency certificate issuance via DigiCert portal. Discovers team does not have saved portal credentials — they were in the departed engineer's 1Password vault. |
| 03:24 | Fatima reaches DigiCert support via phone (emergency line). DigiCert confirms they can issue a validated wildcard cert in ~45 minutes via expedited DV. CSR generated. |
| 03:27 | Kwame drafts customer status page update. Leadership (VP Engineering, Naledi Dlamini) joins war room. |
| 03:31 | Status page updated: "We are aware of an issue affecting API access. Our team is actively working to resolve it." |
| 03:44 | DigiCert requests DNS TXT record for domain validation (_acme-challenge.meridiancloud.io). Yuki adds record to Route 53. Propagation begins. |
| 03:51 | DigiCert validation webhook fires. Certificate issued. Yuki downloads cert bundle (cert + chain + key). |
| 03:54 | Yuki updates Kubernetes TLS secret api-tls-cert in api-gateway namespace: kubectl create secret tls api-tls-cert --cert=wildcard.crt --key=wildcard.key --dry-run=client -o yaml | kubectl apply -f -. |
| 03:58 | Nginx ingress controller begins rolling pod restart to pick up new secret. 12 pods, ~30s each. |
| 04:07 | All Nginx ingress pods restarted. curl -I https://api.meridiancloud.io/health returns HTTP/2 200. Certificate confirmed valid through 2027-07-09. |
| 04:09 | Synthetic monitoring canary recovers. Datadog shows API request volume recovering. Web clients reconnecting successfully. |
| 04:09 | Status page updated: "API access has been restored. We are monitoring for full recovery." |
| 04:11 | Mobile Engineering on-call (Rodrigo Espinoza) reports iOS and Android users still receiving TLS errors despite cert renewal. |
| 04:13 | Investigation: iOS devices cache TLS sessions in the OS keychain. Cached session includes the expired cert's session ticket. Devices must establish a new TLS session before they see the renewed cert. |
| 04:18 | Mobile Engineering pushes push notification to all active mobile users: "Please close and reopen the app." Uptake gradual — push delivery is not instantaneous. |
| 04:43 | Mobile error rates below 5%. Most devices have cycled TLS sessions. |
| 05:49 | Last mobile TLS errors resolve. Full recovery confirmed. Extended monitoring watch begins. |
Impact¶
Customer Impact¶
- 83,000 customers unable to access web or mobile applications from 03:00–05:49 UTC (2h 49m peak impact for mobile)
- API error rate: 100% from 03:00–04:09 UTC (1h 9m complete outage)
- Mobile clients: degraded recovery 04:09–05:49 UTC (additional 1h 40m for TLS session cycle-out)
- 14 third-party integration partners (webhook consumers) received TLS errors and queued or dropped events; 3 partners reported data sync gaps up to 2h 49m
- 4 enterprise customers filed SLA violation claims
Internal Impact¶
- SRE: 2 engineers × 3h = 6 engineer-hours
- Infrastructure Engineering: 2 engineers × 3h = 6 engineer-hours
- Platform Security: 1 engineer × 2.5h = 2.5 engineer-hours
- Mobile Engineering: 1 engineer × 1.5h = 1.5 engineer-hours
- Customer Success: 5 agents × 2h = 10 engineer-hours
- Planned infrastructure sprint items blocked for 1 day during incident review
Data Impact¶
None. No data was lost or corrupted. All requests that failed during the outage window returned TLS errors before any application data was touched. Third-party integration partners that queued events retained those events in their own systems; re-sync was possible for all 14 partners.
Root Cause¶
What Happened (Technical)¶
The wildcard TLS certificate *.api.meridiancloud.io was issued on 2023-07-09 with a 2-year validity period, expiring 2025-07-09 at 00:00 UTC. It was provisioned manually by Elijah Nwosu, a Staff Infrastructure Engineer who left the company in 2024-02. The certificate was stored as a Kubernetes TLS secret and referenced by the Nginx ingress controller. No certificate lifecycle management tooling (such as cert-manager) was in place — cert-manager had been on the infrastructure roadmap since Q3 2024 but had not been implemented.
The issuing engineer had set a personal calendar reminder for certificate rotation, but that calendar was associated with their company Google account, which was deprovisioned when they offboarded. No team-level or system-level reminder existed. The certificate expiry date was not present in any monitored configuration database, CMDB, or ticketing system.
At exactly 00:00 UTC on 2025-07-09, the certificate expired. The Nginx ingress controller continued to serve the expired certificate without error or restart — this is expected behavior; Nginx does not self-restart or alert on cert expiry, it simply serves whatever is configured. All TLS clients performing standard certificate validation began rejecting connections immediately. The API gateway returned no HTTP-level error; the TLS handshake failed before any HTTP exchange occurred.
Detection was delayed 6 minutes because the Pingdom synthetic monitor checks on a 5-minute interval and the first check after expiry fired at 03:06 UTC. There were no proactive expiry monitors — no tool was checking certificate expiry dates in advance.
Mobile client recovery was slower than anticipated because both iOS (via the OS keychain) and Android (via the Conscrypt TLS provider) cache TLS session tickets. When the server presents a renewed certificate, existing TLS sessions are not automatically invalidated. Clients must either close the session (app close/reopen) or wait for the session ticket lifetime to expire (default on iOS: up to 24h). The push notification campaign drove app restarts, but push delivery is not synchronous and took approximately 1h 40m to reach the tail of the active user population.
Contributing Factors¶
-
No certificate expiry monitoring: No alerting existed for certificates approaching expiry. The team had discussed integrating a certificate monitoring tool (e.g., Prometheus
ssl_expiry_secondsexporter or Datadog TLS check) but had not implemented it. The gap between discussion and implementation was over 6 months. -
Manual certificate management with no institutional tracking: The certificate was managed entirely by one individual whose offboarding did not include an infrastructure handoff checklist. No CMDB entry, no calendar event owned by a team account, no ticket with a due date. The institutional knowledge of the cert's expiry date left with the engineer.
-
cert-manager deferred indefinitely: cert-manager was identified as the standard solution for automated certificate lifecycle management and was on the infrastructure roadmap. It was deprioritized three quarters in a row due to competing sprint commitments. Had it been implemented, the Let's Encrypt integration would have auto-renewed the certificate 30 days before expiry.
What We Got Lucky About¶
- The outage began at 03:00 UTC — the lowest-traffic window of the week. At peak hours (13:00–18:00 UTC), concurrent active users are approximately 6× higher. The customer impact count would have been approximately 498,000 users rather than 83,000.
- No data was lost. TLS handshake failures are clean rejections at the transport layer — no partial HTTP requests reached the application, no database writes were partially committed, and no customer data was exposed.
Detection¶
How We Detected¶
The Pingdom synthetic monitoring canary — which performs a full HTTPS GET to api.meridiancloud.io/health every 5 minutes — fired its first failure alert at 03:06 UTC, 6 minutes after expiry. The alert text included SSL: CERTIFICATE_VERIFY_FAILED, which made root cause diagnosis straightforward. PagerDuty routed the alert to the SRE on-call within seconds.
Why We Didn't Detect Sooner¶
There was no proactive monitoring for certificate expiry. A synthetic check is a reactive "is it broken now?" signal, not a predictive "this will break in N days" signal. A certificate expiry exporter checking expiry dates daily would have fired 30-, 14-, and 7-day warnings, giving ample time for scheduled rotation. The team discussed this on at least two occasions in team syncs but the monitoring gap was never formalized into a backlog item.
Response¶
What Went Well¶
- Root cause diagnosis was extremely fast. The
curlandopenssl s_clientcommands produced unambiguous output pointing directly to cert expiry. The on-call SRE had the root cause confirmed in 8 minutes of alert acknowledgment. - The DigiCert emergency issuance process, while slow, was functional. The expedited DV validation took 27 minutes from CSR submission to cert delivery, which is within the documented SLA for emergency issuance.
- Using
kubectl applywith--dry-run=client -o yamlbefore applying the new secret was good practice and allowed Yuki to verify the manifest before committing the change. - The status page was updated within 25 minutes of alert acknowledgment, which is within the 30-minute SLA for customer communication on SEV-1 incidents.
What Went Poorly¶
- The team did not have saved DigiCert portal credentials accessible to on-call engineers. Credential retrieval required a phone call to DigiCert support and consumed 5 minutes during a time-critical incident. Emergency tooling credentials must be accessible in the team vault, not in an individual's personal vault.
- Mobile TLS session caching behavior was not understood or documented. The team expected that replacing the server certificate would immediately resolve errors for all clients. The 1h 40m tail of mobile errors could have been partially mitigated by proactively sending the push notification as soon as the certificate was renewed, rather than waiting for mobile error reports to surface.
- The departed engineer's offboarding did not include an infrastructure handoff. A formal offboarding checklist requiring infrastructure owners to document and transfer ownership of all certificates, secrets, and scheduled tasks would have caught this gap in early 2024.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-001 | Install cert-manager in all production Kubernetes clusters; migrate api-tls-cert and all other manually managed certs to cert-manager-issued Let's Encrypt certs |
P0 | Yuki Tanabe (Infrastructure Engineering) | In Progress | 2025-07-25 |
| AI-002 | Deploy Prometheus ssl_certificate_expiry_seconds exporter; alert at 30d, 14d, and 7d before any monitored cert expires; pipe to PagerDuty with 7d as SEV-2, <7d as SEV-1 |
P0 | Desiree Kamara (SRE) | Not Started | 2025-07-21 |
| AI-003 | Store DigiCert portal credentials and emergency issuance runbook in team 1Password vault; require two-person access; verify credentials are current quarterly | P1 | Fatima Al-Rashid (Platform Security) | Not Started | 2025-07-18 |
| AI-004 | Add certificate inventory step to engineer offboarding checklist: enumerate all certs managed by departing engineer, assign new owner, create calendar event on team calendar | P1 | Platform Security + Engineering Manager (Hana Johansson) | Not Started | 2025-07-28 |
| AI-005 | Document mobile TLS session caching behavior in incident response runbook; add explicit step: "immediately send push notification to cycle mobile sessions" upon cert renewal | P2 | Rodrigo Espinoza (Mobile Engineering) | Not Started | 2025-07-28 |
Lessons Learned¶
-
Certificates managed by individuals are not managed at all. Any infrastructure resource — certificate, secret, scheduled task, or DNS record — that is owned by a named individual and not tracked in a team system is an orphan waiting to fail. The offboarding of the issuing engineer should have triggered a handoff process. It did not because no such process existed. Ownership must live in team systems, not in people.
-
Reactive monitoring cannot substitute for proactive expiry monitoring. A synthetic HTTP check tells you that something is broken. It cannot tell you that something will break in 14 days. These are different signal classes serving different purposes. Certificate expiry is a known, deterministic event — the exact failure time is printed in the cert. Failing to alert on it proactively is a process failure, not a monitoring gap that is hard to close.
-
Mobile TLS session behavior must be part of the incident response mental model. Rotating a server certificate is not instantaneous for mobile clients. OS-level TLS session caching means a subset of users will continue to experience errors for minutes to hours after the server-side fix is complete. Any runbook for certificate rotation must include: (a) expected mobile recovery lag, (b) push notification as an active mitigation, and (c) a separate "mobile recovery" milestone distinct from "server-side fix complete."
Cross-References¶
- Failure Pattern: Configuration — Certificate Lifecycle Not Managed; Knowledge Silo — Single-Person Ownership of Critical Infrastructure
- Topic Packs: tls-certificate-management, cert-manager-kubernetes, mobile-tls-behavior, incident-response-runbooks
- Runbook:
runbooks/security/tls-certificate-emergency-renewal.md(to be created) - Decision Tree: Triage path — "All HTTPS requests failing with TLS error" → immediately run
openssl s_client -connect <host>:443→ checknotAfterdate → if expired, page Infrastructure Engineering and Platform Security simultaneously; do not wait for root cause confirmation to begin cert issuance