Skip to content

Runbook: TLS Certificate Expiry

Field Value
Domain Networking
Alert ssl_certificate_expiry_seconds < 604800 (7 days) or TLS handshake failures in service logs
Severity P1 (if certificate already expired), P2 (if expiring within 7 days)
Est. Resolution Time 30-60 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cert-manager installed (or ability to manage TLS secrets manually), ingress controller access

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get certificate -A
If output shows: READY=False on any certificate → cert-manager renewal has failed, continue from Step 2 If output shows: All READY=True but alerts firing → Certificate is not managed by cert-manager, check raw secrets — see Step 1

Step 1: Check Certificate Expiry Date

Why: Confirms whether the certificate is already expired (P1) or just expiring soon (P2) and which exact resources are affected.

# Check all TLS secrets for expiry
kubectl get secrets -A --field-selector type=kubernetes.io/tls -o json \
  | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): " + (.data["tls.crt"] | @base64d | ltrimstr("-----BEGIN CERTIFICATE-----\n") | split("\n") | .[0])' 2>/dev/null

# For a specific secret, decode and inspect the cert
kubectl get secret <SECRET_NAME> -n <NAMESPACE> -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -dates -subject

# Or test the live endpoint directly
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
  | openssl x509 -noout -dates
Expected output:
notBefore=Jan  1 00:00:00 2026 GMT
notAfter=Apr  1 00:00:00 2026 GMT
subject=CN=myapp.example.com
If this fails: If notAfter is in the past, the cert is expired — this is now P1. Move quickly through Steps 2-4.

Step 2: Check cert-manager Certificate Resource Status

Why: cert-manager's Certificate object shows why renewal failed — wrong issuer, DNS challenge errors, or rate limits.

# List all Certificate resources and their status
kubectl get certificate -A -o wide

# Describe the failing certificate for detailed status
kubectl describe certificate <CERTIFICATE_NAME> -n <NAMESPACE>
Expected output (healthy):
Status:
  Conditions:
    Last Transition Time:  2026-03-15T10:00:00Z
    Message:               Certificate is up to date and has not expired
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Not After:               2026-06-15T10:00:00Z
  Renewal Time:            2026-05-16T10:00:00Z
If this fails: Look for Reason: Failed or error messages in the Message field. Common causes: ACME challenge failed, issuer misconfigured, rate limited. Proceed to Step 3.

Step 3: Check ACME Challenge or Certificate Issuer Logs

Why: cert-manager logs and challenge resources reveal exactly where the renewal pipeline is broken.

# Check cert-manager controller logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100 | grep -i "error\|failed\|<CERTIFICATE_NAME>"

# Check for pending ACME challenges
kubectl get challenges -A
kubectl describe challenge <CHALLENGE_NAME> -n <NAMESPACE>

# Check the Issuer or ClusterIssuer being used
kubectl get clusterissuer <ISSUER_NAME> -o yaml
kubectl describe clusterissuer <ISSUER_NAME>
Expected output:
# No challenges should be in Pending/Error state for more than a few minutes
No resources found.
If this fails: If challenges are stuck, the ACME DNS or HTTP solver cannot complete. Check that the ingress for the challenge path is reachable, or that DNS records are correct. If rate limited (429), wait or use a staging issuer.

Step 4: Manually Trigger Renewal

Why: cert-manager will auto-renew but sometimes needs a nudge — deleting the CertificateRequest forces a fresh attempt.

# Delete the current CertificateRequest to force a new one
kubectl get certificaterequest -n <NAMESPACE>
kubectl delete certificaterequest <CERTIFICATEREQUEST_NAME> -n <NAMESPACE>

# Alternatively, annotate the Certificate to force immediate renewal
kubectl annotate certificate <CERTIFICATE_NAME> -n <NAMESPACE> \
  cert-manager.io/issuer-kind=ClusterIssuer --overwrite

# Watch the Certificate status for renewal
kubectl get certificate <CERTIFICATE_NAME> -n <NAMESPACE> -w
Expected output:
NAME          READY   SECRET          AGE
my-cert       False   my-tls-secret   5m
my-cert       True    my-tls-secret   7m
If this fails: If it stays False for more than 5 minutes, check the issuer logs again. For an already-expired cert with a broken issuer, proceed to manual secret replacement in Step 5.

Step 5: Verify New Certificate Is Installed

Why: Even after renewal, the secret must contain the new cert and the ingress controller must have loaded it.

# Check the TLS secret was updated recently
kubectl get secret <TLS_SECRET_NAME> -n <NAMESPACE>

# Verify the new expiry date
kubectl get secret <TLS_SECRET_NAME> -n <NAMESPACE> \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Check ingress is using the right secret
kubectl get ingress <INGRESS_NAME> -n <NAMESPACE> -o yaml | grep -A5 tls:
Expected output:
notAfter=Jun 18 12:00:00 2026 GMT
If this fails: If the secret still shows the old expiry, the renewal did not write back correctly. Check cert-manager permissions on the secret (RBAC). If the ingress controller has cached the old cert, restart it: kubectl rollout restart deployment/<INGRESS_CONTROLLER_DEPLOYMENT> -n <INGRESS_NAMESPACE>.

Step 6: Test TLS Handshake

Why: Confirms that the new certificate is being served to clients and the chain is valid.

# Test from outside the cluster (replace with actual hostname)
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
  | openssl x509 -noout -dates -issuer -subject

# Check certificate chain validity
curl -v https://<HOSTNAME>/ 2>&1 | grep -E "SSL|TLS|certificate|expire|verify"
Expected output:
subject=CN=myapp.example.com
issuer=CN=Let's Encrypt
notAfter=Jun 18 12:00:00 2026 GMT
* SSL certificate verify ok.
If this fails: If the browser or curl still shows expired cert, the CDN or load balancer may be caching the old cert. Purge the CDN cache or check LB SSL policy.

Verification

# Confirm the issue is resolved — check live TLS and cert-manager status
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
  | openssl x509 -noout -checkend 604800
Success looks like: Certificate will not expire (exit code 0). All kubectl get certificate -A rows show READY=True. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min Platform/Security on-call "TLS certificate expired on , traffic is being rejected, cert-manager renewal failing"
Data loss suspected Security team lead "Possible MITM window during cert expiry on , need security review"
Scope expanding to multiple services SRE lead "Multiple TLS certs expired or failing renewal, possible cert-manager or issuer outage"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete

Common Mistakes

  1. Restarting pods instead of renewing the cert: Restarting application pods does nothing for certificate expiry. The TLS secret is what needs updating, not the app.
  2. Not checking issuer configuration: Cert-manager renewal can fail silently if the ClusterIssuer has a stale token or wrong ACME server. Always check issuer status, not just certificate status.
  3. Forgetting to check all ingresses: A hostname may be used by multiple ingresses in different namespaces, each with its own TLS secret. Run kubectl get ingress -A to find all affected ingresses.

Cross-References


Wiki Navigation