Skip to content

Portal | Level: L2: Operations | Topics: TLS & PKI | Domain: Security

Scenario: TLS Certificate Expired

The Prompt

"Our production API started returning TLS errors 30 minutes ago. All browsers show 'Your connection is not private' and API clients are getting 'certificate has expired.' What do you do?"

Initial Report

On-call page: "grokdevops.example.com TLS cert expired. All HTTPS traffic is failing. cert-manager is installed but renewal apparently failed."

Constraints

  • Time pressure: Every minute costs revenue - all API clients are failing.
  • Limited access: You have kubectl access but cannot directly modify DNS or cloud load balancer settings.

Observable Evidence

  • Browser: NET::ERR_CERT_DATE_INVALID
  • curl: curl: (60) SSL certificate problem: certificate has expired
  • cert-manager: kubectl get certificate -n grokdevops shows READY: False
  • Ingress: TLS secret exists but contains the expired cert

Expected Investigation Path

# 1. Confirm the cert is expired
kubectl get secret grokdevops-tls -n grokdevops -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -dates

# 2. Check cert-manager Certificate status
kubectl describe certificate grokdevops-tls -n grokdevops

# 3. Check CertificateRequest
kubectl get certificaterequest -n grokdevops
kubectl describe certificaterequest <name> -n grokdevops

# 4. Check ACME challenges (if Let's Encrypt)
kubectl get challenges -A

# 5. Check cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager --tail=100 | grep grokdevops

# 6. Force renewal
kubectl cert-manager renew grokdevops-tls -n grokdevops

# 7. If renewal still fails, check the Issuer
kubectl describe clusterissuer letsencrypt-prod

Root Cause Possibilities

  1. ACME challenge failure — HTTP-01 challenge couldn't reach port 80 (firewall change, ingress misconfigured)
  2. Rate limited — Let's Encrypt rate limit hit (5 certs/domain/week)
  3. cert-manager webhook down — cert-manager pods crashed and nobody noticed
  4. DNS change — Domain no longer points to the cluster
  5. Issuer misconfigured — Issuer secret (ACME account key) was deleted

What a Strong Answer Includes

  • Immediate check of cert expiry with openssl
  • Systematic debugging through cert-manager's CRD chain (Certificate -> CertificateRequest -> Challenge)
  • Understanding of ACME challenge types
  • Mention of monitoring: "We should have an alert on cert expiry"
  • Post-incident: add certmanager_certificate_expiration_timestamp_seconds alert at 14-day threshold

Wiki Navigation