- networking
- l2
- runbook
- tls
- tls-pki --- Portal | Level: L2: Operations | Topics: TLS & PKI | Domain: Networking
Runbook: TLS Certificate Expiry¶
| Field | Value |
|---|---|
| Domain | Networking |
| Alert | ssl_certificate_expiry_seconds < 604800 (7 days) or TLS handshake failures in service logs |
| Severity | P1 (if certificate already expired), P2 (if expiring within 7 days) |
| Est. Resolution Time | 30-60 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cert-manager installed (or ability to manage TLS secrets manually), ingress controller access |
Quick Assessment (30 seconds)¶
If output shows:READY=False on any certificate → cert-manager renewal has failed, continue from Step 2
If output shows: All READY=True but alerts firing → Certificate is not managed by cert-manager, check raw secrets — see Step 1
Step 1: Check Certificate Expiry Date¶
Why: Confirms whether the certificate is already expired (P1) or just expiring soon (P2) and which exact resources are affected.
# Check all TLS secrets for expiry
kubectl get secrets -A --field-selector type=kubernetes.io/tls -o json \
| jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): " + (.data["tls.crt"] | @base64d | ltrimstr("-----BEGIN CERTIFICATE-----\n") | split("\n") | .[0])' 2>/dev/null
# For a specific secret, decode and inspect the cert
kubectl get secret <SECRET_NAME> -n <NAMESPACE> -o jsonpath='{.data.tls\.crt}' \
| base64 -d | openssl x509 -noout -dates -subject
# Or test the live endpoint directly
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
| openssl x509 -noout -dates
notAfter is in the past, the cert is expired — this is now P1. Move quickly through Steps 2-4.
Step 2: Check cert-manager Certificate Resource Status¶
Why: cert-manager's Certificate object shows why renewal failed — wrong issuer, DNS challenge errors, or rate limits.
# List all Certificate resources and their status
kubectl get certificate -A -o wide
# Describe the failing certificate for detailed status
kubectl describe certificate <CERTIFICATE_NAME> -n <NAMESPACE>
Status:
Conditions:
Last Transition Time: 2026-03-15T10:00:00Z
Message: Certificate is up to date and has not expired
Reason: Ready
Status: True
Type: Ready
Not After: 2026-06-15T10:00:00Z
Renewal Time: 2026-05-16T10:00:00Z
Reason: Failed or error messages in the Message field. Common causes: ACME challenge failed, issuer misconfigured, rate limited. Proceed to Step 3.
Step 3: Check ACME Challenge or Certificate Issuer Logs¶
Why: cert-manager logs and challenge resources reveal exactly where the renewal pipeline is broken.
# Check cert-manager controller logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100 | grep -i "error\|failed\|<CERTIFICATE_NAME>"
# Check for pending ACME challenges
kubectl get challenges -A
kubectl describe challenge <CHALLENGE_NAME> -n <NAMESPACE>
# Check the Issuer or ClusterIssuer being used
kubectl get clusterissuer <ISSUER_NAME> -o yaml
kubectl describe clusterissuer <ISSUER_NAME>
Step 4: Manually Trigger Renewal¶
Why: cert-manager will auto-renew but sometimes needs a nudge — deleting the CertificateRequest forces a fresh attempt.
# Delete the current CertificateRequest to force a new one
kubectl get certificaterequest -n <NAMESPACE>
kubectl delete certificaterequest <CERTIFICATEREQUEST_NAME> -n <NAMESPACE>
# Alternatively, annotate the Certificate to force immediate renewal
kubectl annotate certificate <CERTIFICATE_NAME> -n <NAMESPACE> \
cert-manager.io/issuer-kind=ClusterIssuer --overwrite
# Watch the Certificate status for renewal
kubectl get certificate <CERTIFICATE_NAME> -n <NAMESPACE> -w
False for more than 5 minutes, check the issuer logs again. For an already-expired cert with a broken issuer, proceed to manual secret replacement in Step 5.
Step 5: Verify New Certificate Is Installed¶
Why: Even after renewal, the secret must contain the new cert and the ingress controller must have loaded it.
# Check the TLS secret was updated recently
kubectl get secret <TLS_SECRET_NAME> -n <NAMESPACE>
# Verify the new expiry date
kubectl get secret <TLS_SECRET_NAME> -n <NAMESPACE> \
-o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# Check ingress is using the right secret
kubectl get ingress <INGRESS_NAME> -n <NAMESPACE> -o yaml | grep -A5 tls:
kubectl rollout restart deployment/<INGRESS_CONTROLLER_DEPLOYMENT> -n <INGRESS_NAMESPACE>.
Step 6: Test TLS Handshake¶
Why: Confirms that the new certificate is being served to clients and the chain is valid.
# Test from outside the cluster (replace with actual hostname)
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
| openssl x509 -noout -dates -issuer -subject
# Check certificate chain validity
curl -v https://<HOSTNAME>/ 2>&1 | grep -E "SSL|TLS|certificate|expire|verify"
subject=CN=myapp.example.com
issuer=CN=Let's Encrypt
notAfter=Jun 18 12:00:00 2026 GMT
* SSL certificate verify ok.
Verification¶
# Confirm the issue is resolved — check live TLS and cert-manager status
echo | openssl s_client -connect <HOSTNAME>:443 -servername <HOSTNAME> 2>/dev/null \
| openssl x509 -noout -checkend 604800
Certificate will not expire (exit code 0). All kubectl get certificate -A rows show READY=True.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | Platform/Security on-call | "TLS certificate expired on |
| Data loss suspected | Security team lead | "Possible MITM window during cert expiry on |
| Scope expanding to multiple services | SRE lead | "Multiple TLS certs expired or failing renewal, possible cert-manager or issuer outage" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Restarting pods instead of renewing the cert: Restarting application pods does nothing for certificate expiry. The TLS secret is what needs updating, not the app.
- Not checking issuer configuration: Cert-manager renewal can fail silently if the ClusterIssuer has a stale token or wrong ACME server. Always check issuer status, not just certificate status.
- Forgetting to check all ingresses: A hostname may be used by multiple ingresses in different namespaces, each with its own TLS secret. Run
kubectl get ingress -Ato find all affected ingresses.
Cross-References¶
- Topic Pack: cert-manager and TLS (deep background)
- Related Runbook: Load Balancer Health Check Failure
Wiki Navigation¶
Related Content¶
- Case Study: BMC Clock Skew Cert Failure (Case Study, L2) — TLS & PKI
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — TLS & PKI
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — TLS & PKI
- Case Study: SSL Cert Chain Incomplete (Case Study, L1) — TLS & PKI
- Case Study: User Auth Failing — OIDC Cert Expired, Cloud KMS Rotation (Case Study, L2) — TLS & PKI
- Deep Dive: TLS Handshake (deep_dive, L2) — TLS & PKI
- HTTP Protocol (Topic Pack, L0) — TLS & PKI
- Interview: Certificate Expired (Scenario, L2) — TLS & PKI
- Networking Deep Dive (Topic Pack, L1) — TLS & PKI
- Nginx & Web Servers (Topic Pack, L1) — TLS & PKI
Pages that link here¶
- HTTP Protocol
- HTTP Protocol - Primer
- Nginx & Web Servers
- Operational Runbooks
- Runbook: Certificate Renewal Failed
- Runbook: Load Balancer Health Check Failure
- Scenario: DNS Resolves Correctly but Application Fails to Connect
- Scenario: TLS Certificate Expired
- Symptoms: User Auth Failing, OIDC Cert Expired, Fix Is Cloud KMS Rotation
- TLS & Certificates Ops
- TLS & Certificates Ops - Primer
- TLS & PKI - Skill Check
- TLS & PKI Drills
- TLS Handshake Deep Dive
- TLS Works From Some Clients But Fails From Others