Portal | Level: L2: Operations | Topics: TLS & PKI | Domain: Security
Scenario: TLS Certificate Expired¶
The Prompt¶
"Our production API started returning TLS errors 30 minutes ago. All browsers show 'Your connection is not private' and API clients are getting 'certificate has expired.' What do you do?"
Initial Report¶
On-call page: "grokdevops.example.com TLS cert expired. All HTTPS traffic is failing. cert-manager is installed but renewal apparently failed."
Constraints¶
- Time pressure: Every minute costs revenue - all API clients are failing.
- Limited access: You have kubectl access but cannot directly modify DNS or cloud load balancer settings.
Observable Evidence¶
- Browser: NET::ERR_CERT_DATE_INVALID
- curl:
curl: (60) SSL certificate problem: certificate has expired - cert-manager:
kubectl get certificate -n grokdevopsshowsREADY: False - Ingress: TLS secret exists but contains the expired cert
Expected Investigation Path¶
# 1. Confirm the cert is expired
kubectl get secret grokdevops-tls -n grokdevops -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -noout -dates
# 2. Check cert-manager Certificate status
kubectl describe certificate grokdevops-tls -n grokdevops
# 3. Check CertificateRequest
kubectl get certificaterequest -n grokdevops
kubectl describe certificaterequest <name> -n grokdevops
# 4. Check ACME challenges (if Let's Encrypt)
kubectl get challenges -A
# 5. Check cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager --tail=100 | grep grokdevops
# 6. Force renewal
kubectl cert-manager renew grokdevops-tls -n grokdevops
# 7. If renewal still fails, check the Issuer
kubectl describe clusterissuer letsencrypt-prod
Root Cause Possibilities¶
- ACME challenge failure — HTTP-01 challenge couldn't reach port 80 (firewall change, ingress misconfigured)
- Rate limited — Let's Encrypt rate limit hit (5 certs/domain/week)
- cert-manager webhook down — cert-manager pods crashed and nobody noticed
- DNS change — Domain no longer points to the cluster
- Issuer misconfigured — Issuer secret (ACME account key) was deleted
What a Strong Answer Includes¶
- Immediate check of cert expiry with openssl
- Systematic debugging through cert-manager's CRD chain (Certificate -> CertificateRequest -> Challenge)
- Understanding of ACME challenge types
- Mention of monitoring: "We should have an alert on cert expiry"
- Post-incident: add
certmanager_certificate_expiration_timestamp_secondsalert at 14-day threshold
Wiki Navigation¶
Related Content¶
- Case Study: BMC Clock Skew Cert Failure (Case Study, L2) — TLS & PKI
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — TLS & PKI
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — TLS & PKI
- Case Study: SSL Cert Chain Incomplete (Case Study, L1) — TLS & PKI
- Case Study: User Auth Failing — OIDC Cert Expired, Cloud KMS Rotation (Case Study, L2) — TLS & PKI
- Deep Dive: TLS Handshake (deep_dive, L2) — TLS & PKI
- HTTP Protocol (Topic Pack, L0) — TLS & PKI
- Networking Deep Dive (Topic Pack, L1) — TLS & PKI
- Nginx & Web Servers (Topic Pack, L1) — TLS & PKI
- Ops Archaeology: The Certificate That Works Sometimes (Case Study, L2) — TLS & PKI
Pages that link here¶
- HTTP Protocol
- HTTP Protocol - Primer
- Interview Scenarios
- Master Curriculum: 40 Weeks
- Nginx & Web Servers
- Nginx & Web Servers - Primer
- Ops Archaeology: The Certificate That Works Sometimes
- Runbook: Certificate Renewal Failed
- Runbook: TLS Certificate Expiry
- Symptoms: User Auth Failing, OIDC Cert Expired, Fix Is Cloud KMS Rotation
- TLS & Certificates Ops
- TLS & Certificates Ops - Primer
- TLS & PKI - Skill Check
- TLS & PKI Drills
- TLS Handshake Deep Dive