Skip to content

Incident Replay: SSL Certificate Chain Incomplete

Setup

  • System context: Production web server with a new TLS certificate. Some clients can connect fine, others get 'certificate verify failed' errors. Mobile clients are particularly affected.
  • Time: Monday 10:00 UTC
  • Your role: On-call SRE

Round 1: Alert Fires

[Pressure cue: "Customer reports: 'Your website works on my laptop but not my phone.' SSL Labs gives the site a B rating with 'Chain issues: Incomplete.'"]

What you see: The server sends the leaf certificate but not the intermediate CA certificate. Desktop browsers have the intermediate cached from other sites, so they work. Mobile browsers and API clients that do not have the intermediate cached fail verification.

Choose your action: - A) Apply a quick workaround to restore service - B) Investigate the root cause systematically - C) Escalate to the vendor or upstream provider - D) Check if a recent change caused the issue

If you chose A:

[Result: Workaround provides temporary relief but masks the underlying issue. You will need to circle back.]

[Result: Systematic investigation reveals the root cause. When the new certificate was installed, only the leaf cert was added to nginx. The intermediate certificate was provided by the CA but not concatenated into the cert file. The nginx config points to a file with just the leaf cert. Proceed to Round 2.]

If you chose C:

[Result: Vendor/upstream confirms the issue is on your side. Time wasted on external coordination.]

If you chose D:

[Result: Change log review helps narrow the timeline but does not directly identify the technical cause. Partial progress.]

Round 2: First Triage Data

[Pressure cue: "Root cause identified. Apply the fix."]

What you see: When the new certificate was installed, only the leaf cert was added to nginx. The intermediate certificate was provided by the CA but not concatenated into the cert file. The nginx config points to a file with just the leaf cert.

Choose your action: - A) Apply the targeted fix - B) Apply the fix and verify with testing - C) Apply a broader fix that addresses the class of problem - D) Document and schedule the fix for the next maintenance window

[Result: Concatenate the leaf cert + intermediate cert: cat leaf.crt intermediate.crt > fullchain.crt. Update nginx to use the fullchain. Reload nginx. Verify with openssl s_client -connect host:443 -showcerts. Service restored and verified. Proceed to Round 3.]

If you chose A:

[Result: Fix applied but not verified. May not be complete.]

If you chose C:

[Result: Broader fix is correct long-term but takes longer to implement during an incident.]

If you chose D:

[Result: Delaying the fix extends the outage or degradation. Apply now if possible.]

Round 3: Root Cause Identification

[Pressure cue: "Service restored. Document and prevent recurrence."]

What you see: Root cause confirmed: When the new certificate was installed, only the leaf cert was added to nginx. The intermediate certificate was provided by the CA but not concatenated into the cert file. The nginx config points to a file with just the leaf cert.

Choose your action: - A) Document the fix in the runbook - B) Add monitoring to detect this condition - C) Add the fix to automation/configuration management - D) All of the above

[Result: Documentation, monitoring, and automation all updated. Defense in depth prevents recurrence. Proceed to Round 4.]

If you chose A:

[Result: Documentation helps but relies on humans remembering to check it.]

If you chose B:

[Result: Monitoring detects faster but does not prevent.]

If you chose C:

[Result: Automation prevents recurrence but needs monitoring for edge cases.]

Round 4: Remediation

[Pressure cue: "Verify everything and close the incident."]

Actions: 1. Verify service is functioning correctly end-to-end 2. Verify monitoring detects the condition 3. Update runbooks and configuration management 4. Schedule post-mortem review 5. Check for similar issues across the infrastructure

Damage Report

  • Total downtime: Varies based on path chosen
  • Blast radius: Affected services and dependent systems
  • Optimal resolution time: 8 minutes
  • If every wrong choice was made: 60 minutes + additional damage

Cross-References