Decision Tree: Certificate Is Expiring — What Do I Do?¶
Category: Operational Decisions Starting Question: "A TLS certificate is expiring — what's the renewal path?" Estimated traversal: 3-5 minutes Domains: certificates, TLS, PKI, cert-manager, security, SRE
The Tree¶
A TLS certificate is expiring — what's the renewal path?
│
├── [Check 1] How much time until expiry?
│ │
│ ├── EXPIRED already (or < 24 hours)
│ │ └── ⚠️ THIS IS AN INCIDENT — treat as P1 if user-facing
│ │ ├── Is the service currently returning TLS errors to users?
│ │ │ ├── YES → ✅ EMERGENCY RENEWAL (see terminal action)
│ │ │ │ Page on-call immediately if not already on it
│ │ │ └── NO (cert expired but service still serving — cached/stapled)
│ │ │ └── Renewal is still URGENT — proceed as < 7 days
│ │
│ ├── < 7 days (critical — act today)
│ │ ├── [Check 2] What type of certificate?
│ │ │ ├── Let's Encrypt (via cert-manager)
│ │ │ │ ├── [Check 3] Is cert-manager Certificate object healthy?
│ │ │ │ │ ├── YES (Ready=True, no ACME errors) → auto-renewal in progress
│ │ │ │ │ │ └── → ✅ MONITOR — check again in 1 hour, act if not renewed
│ │ │ │ │ └── NO (Ready=False, errors present)
│ │ │ │ │ ├── [Check 4] What is the ACME challenge failing on?
│ │ │ │ │ │ ├── HTTP-01 challenge failing
│ │ │ │ │ │ │ ├── Is the ingress correctly routing /.well-known/acme-challenge/?
│ │ │ │ │ │ │ │ └── → ✅ FIX INGRESS ANNOTATION or path routing
│ │ │ │ │ │ ├── DNS-01 challenge failing
│ │ │ │ │ │ │ ├── Is the DNS provider secret correct?
│ │ │ │ │ │ │ │ ├── YES → check DNS propagation (TTL may be slow)
│ │ │ │ │ │ │ │ └── NO → ✅ UPDATE DNS PROVIDER SECRET + retry
│ │ │ │ │ │ ├── Rate limit hit (429 from ACME)
│ │ │ │ │ │ │ └── → ✅ USE STAGING ACME TO TEST, wait 1 hour for limit reset
│ │ │ │ │ │ └── Unknown error
│ │ │ │ │ │ └── → ✅ CHECK CERT-MANAGER LOGS + escalate to platform team
│ │ │ │
│ │ │ ├── ACM (AWS Certificate Manager)
│ │ │ │ ├── Is DNS validation CNAME present in Route53?
│ │ │ │ │ ├── YES → ACM auto-renews; check ACM console for status
│ │ │ │ │ └── NO → ✅ ADD VALIDATION CNAME to DNS (see terminal action)
│ │ │ │ └── Is the cert attached to an ALB / CloudFront?
│ │ │ │ ├── YES + DNS valid → auto-renewal should work; check ACM console
│ │ │ │ └── NO → cert won't auto-renew; attach to a service or renew manually
│ │ │ │
│ │ │ ├── Self-signed (internal, dev, or legacy service)
│ │ │ │ └── → ✅ MANUAL RENEWAL — generate new cert + update secret (see action)
│ │ │ │
│ │ │ └── CA-signed (OV/EV, internal PKI, vendor-issued)
│ │ │ ├── Is there an internal CA with automated issuance?
│ │ │ │ ├── YES → ✅ REQUEST NEW CERT from internal CA automation
│ │ │ │ └── NO → ✅ ESCALATE TO CA/VENDOR — OV/EV takes 1–7 business days
│ │ │
│ ├── 7–30 days (urgent — schedule renewal this week)
│ │ ├── [Check 2] What type of certificate?
│ │ │ ├── Let's Encrypt → [Check 3] (same as above, should auto-renew at 2/3 of lifetime)
│ │ │ ├── ACM → verify DNS validation, then auto-renews
│ │ │ ├── Self-signed → ✅ SCHEDULE MANUAL RENEWAL in next sprint
│ │ │ └── CA-signed (OV/EV) → ✅ INITIATE REQUEST NOW — lead time may be > 7 days
│ │ │
│ │ ├── [Check 5] Does the cert cover all required SANs?
│ │ │ ├── YES → proceed with renewal
│ │ │ └── NO (new domains added, old domains removed)
│ │ │ └── ✅ RENEWAL IS A REISSUANCE — coordinate SAN update with domain owners
│ │ │
│ │ └── [Check 6] Does the cert cover all required domains / wildcard?
│ │ ├── Single domain cert → confirm domain still in use
│ │ └── Wildcard cert → confirm all subdomains are still needed
│ │
│ └── > 30 days (planned — no urgency)
│ ├── Is auto-renewal configured and working?
│ │ ├── YES → log it and check again at 30-day mark
│ │ └── NO → ✅ FIX AUTO-RENEWAL NOW while you have time
│ │
│ └── [Check 7] Is this cert in your certificate inventory?
│ ├── YES → schedule renewal reminder for 30-day mark
│ └── NO → ✅ ADD TO CERT INVENTORY (prevent future surprise expiries)
Node Details¶
Check 1: Time until expiry¶
Command/method:
# Check cert expiry from command line
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
| openssl x509 -noout -dates
# Check all certs in a namespace
kubectl get certificate -n production -o wide
# Check cert expiry across all namespaces
kubectl get certificate --all-namespaces \
-o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}{end}' \
| sort -k3
# Days until expiry for a specific domain
python3 -c "
import ssl, socket, datetime
ctx = ssl.create_default_context()
conn = ctx.wrap_socket(socket.socket(), server_hostname='example.com')
conn.connect(('example.com', 443))
cert = conn.getpeercert()
expiry = datetime.datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')
print(f'Expires: {expiry} ({(expiry - datetime.datetime.now(datetime.UTC)).days} days)')
"
Check 2: Certificate type identification¶
Command/method:
# Check cert-manager Certificate object
kubectl get certificate -n production -o yaml | grep -A10 "issuerRef:"
# Look for: issuerRef.kind = ClusterIssuer and issuerRef.name = letsencrypt-prod
# Check if ACM cert (look for ARN)
aws acm list-certificates --query 'CertificateSummaryList[?DomainName==`example.com`]'
# Check certificate issuer from the live cert
echo | openssl s_client -connect example.com:443 2>/dev/null \
| openssl x509 -noout -issuer
# Let's Encrypt: "issuer= /C=US/O=Let's Encrypt/CN=R3"
# ACM: "issuer= /C=US/O=Amazon/OU=Server CA 1B/CN=Amazon"
# Self-signed: issuer == subject
issuerRef in cert-manager = automated. ACM ARN = automated if DNS validation cname is present. Self-signed or private CA = manual renewal.
Common pitfall: A certificate deployed to Kubernetes as a Secret (TLS type) may have been originally issued by cert-manager but is now manually managed (cert-manager Certificate object was deleted). Check for both the Secret and the Certificate object.
Check 3: cert-manager Certificate object health¶
Command/method:
# Check Certificate status
kubectl describe certificate myapp-tls -n production
# Look for: Status: True, Type: Ready
# And: Message: Certificate is up to date and has not expired
# Check CertificateRequest objects
kubectl get certificaterequest -n production -l \
"cert-manager.io/certificate-name=myapp-tls" --sort-by=.metadata.creationTimestamp
# Check Order objects (for ACME)
kubectl get order -n production
# Check Challenge objects
kubectl get challenge -n production
kubectl describe challenge -n production | grep -A10 "Status:"
Ready=True and Not After in the future = healthy. Any False condition or pending Challenge objects = auto-renewal is failing.
Common pitfall: A Certificate shows Ready=True but the Secret was manually updated with an older cert. The Certificate object status reflects the cert-manager state, not necessarily the actual Secret contents. Verify the Secret's cert expiry independently.
Check 4: ACME challenge diagnosis¶
Command/method:
# HTTP-01 challenge: test that the challenge path is reachable
# cert-manager places a token at /.well-known/acme-challenge/<token>
curl -v http://example.com/.well-known/acme-challenge/test-token
# Check Ingress annotations for HTTP-01 solver
kubectl get ingress -n production -o yaml | grep -A5 "acme-challenge"
# DNS-01 challenge: verify TXT record was created
dig _acme-challenge.example.com TXT +short
# Check DNS provider secret is valid
kubectl get secret -n cert-manager | grep "route53\|clouddns\|cloudflare"
kubectl get secret route53-credentials -n cert-manager -o jsonpath='{.data.secret-access-key}' | base64 -d | wc -c
# cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --since=1h | grep -i "error\|fail\|challenge"
# ACME rate limit check (429 responses)
kubectl logs -n cert-manager -l app=cert-manager --since=1h | grep "429\|rate limit"
Check 5: SAN coverage verification¶
Command/method:
# List all SANs in the current certificate
echo | openssl s_client -connect example.com:443 2>/dev/null \
| openssl x509 -noout -text | grep -A5 "Subject Alternative Name"
# Compare with what's in the cert-manager Certificate spec
kubectl get certificate myapp-tls -n production \
-o jsonpath='{.spec.dnsNames[*]}'
# Check for recently added virtual hosts / subdomains
kubectl get ingress -n production \
-o jsonpath='{range .items[*]}{.spec.rules[*].host}{"\n"}{end}' | sort -u
api.example.com was added to an ingress but not to the certificate's dnsNames, that subdomain will fail TLS.
Common pitfall: Wildcard certs (*.example.com) do NOT cover the apex domain (example.com). If both are needed, both must be listed explicitly.
Terminal Actions¶
✅ Action: Emergency Renewal (expired or < 24 hours)¶
Do:
# 1. Generate a new certificate immediately (manual path)
# For Let's Encrypt via certbot (standalone mode — temporary HTTP server)
certbot certonly --standalone \
-d example.com \
--email admin@example.com \
--agree-tos \
--non-interactive
# 2. Update Kubernetes Secret with new cert
kubectl create secret tls myapp-tls \
--cert=/etc/letsencrypt/live/example.com/fullchain.pem \
--key=/etc/letsencrypt/live/example.com/privkey.pem \
-n production \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Roll pods to pick up new cert (if mounted as Secret volume)
kubectl rollout restart deployment/myapp -n production
# 4. Verify new cert is live
echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates
# 5. Page on-call if not already handling — this was an incident
notAfter date is in the future, TLS handshake completes without errors, browser shows valid cert.
Runbook: cert-emergency-renewal.md
✅ Action: Verify cert-manager Certificate Object and Fix Challenge¶
Do:
# 1. Delete stuck Challenge/Order to force re-attempt
kubectl delete challenge -n production --all
kubectl delete order -n production --all
# 2. Force cert-manager to re-process the Certificate
kubectl annotate certificate myapp-tls -n production \
cert-manager.io/issue-temporary-certificate="true"
# OR: delete and recreate the Certificate object from your manifest
kubectl delete certificate myapp-tls -n production
kubectl apply -f kubernetes/production/certificate-myapp.yaml
# 3. Watch the new Order and Challenge
kubectl get challenge -n production -w
# 4. Check cert-manager controller logs in real time
kubectl logs -n cert-manager -l app=cert-manager -f | grep -i "example.com"
# 5. Verify certificate issued
kubectl get certificate myapp-tls -n production -o wide
Ready=True and Not After is at least 60 days in the future.
Runbook: cert-manager-troubleshooting.md
✅ Action: Add ACM DNS Validation CNAME¶
Do:
# 1. Get the CNAME name and value from ACM
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/abc123 \
--query 'Certificate.DomainValidationOptions[*].ResourceRecord'
# 2. Add the CNAME to Route53 (or your DNS provider)
aws route53 change-resource-record-sets \
--hosted-zone-id ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "_acme-challenge.example.com",
"Type": "CNAME",
"TTL": 300,
"ResourceRecords": [{"Value": "validation-token.acm-validations.aws"}]
}
}]
}'
# 3. Wait for DNS propagation (up to 5 minutes for low TTL)
dig _acme-challenge.example.com CNAME +short
# 4. Check ACM console — status should change to "Issued" within 30 minutes
aws acm describe-certificate \
--certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/abc123 \
--query 'Certificate.Status'
Status: ISSUED. This CNAME persists and enables future auto-renewals — do not remove it.
✅ Action: Manual Renewal for Self-Signed or Internal CA¶
Do:
# 1. Generate new private key and CSR
openssl genrsa -out new.key 4096
openssl req -new -key new.key -out new.csr \
-subj "/CN=example.com/O=MyOrg/C=US" \
-addext "subjectAltName=DNS:example.com,DNS:www.example.com"
# 2a. Self-signed: sign with your own key
openssl x509 -req -days 365 -in new.csr -signkey new.key -out new.crt \
-extfile <(printf "subjectAltName=DNS:example.com,DNS:www.example.com")
# 2b. Internal CA: send CSR to CA
# Submit new.csr to your internal CA portal or team
# 3. Update Kubernetes Secret
kubectl create secret tls myapp-tls \
--cert=new.crt --key=new.key \
-n production \
--dry-run=client -o yaml | kubectl apply -f -
# 4. Verify expiry
openssl x509 -noout -dates -in new.crt
# 5. Restart pods to load new cert
kubectl rollout restart deployment/myapp -n production
✅ Action: Add to Certificate Inventory¶
Do:
# Add to your cert tracking spreadsheet / database / wiki
# Minimum fields:
# - Domain(s) / SANs
# - Issuer type (Let's Encrypt, ACM, self-signed, CA vendor)
# - Expiry date
# - Owner / team
# - Auto-renewal: yes/no + mechanism
# - Kubernetes namespace and Secret name if applicable
# - Last renewed date
# - Alert: configured at 30 days before expiry
# Set up a monitoring alert for this cert
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cert-expiry-myapp
namespace: monitoring
spec:
groups:
- name: cert-expiry
rules:
- alert: CertExpiringIn30Days
expr: probe_ssl_earliest_cert_expiry{job="blackbox",instance="https://example.com"} - time() < 30 * 24 * 3600
labels:
severity: warning
annotations:
summary: "TLS cert for example.com expires in < 30 days"
EOF
⚠️ Warning: OV/EV Certificate Lead Time¶
When: The expiring cert is an OV (Organization Validated) or EV (Extended Validation) certificate, which requires manual verification by the CA. Risk: OV certs take 1–7 business days; EV certs can take up to 14 business days. If you wait until < 7 days before expiry, you will have a gap. Mitigation: Initiate OV/EV renewal at the 30-day mark, not the 7-day mark. If < 7 days remain, request emergency CA expediting and simultaneously deploy a Let's Encrypt cert as a stopgap (if DV is acceptable for your trust requirements).
Edge Cases¶
- CDN-terminated TLS: When a CDN (Cloudflare, Fastly, Akamai) terminates TLS, the CDN's certificate is what users see. The origin certificate (between CDN and your servers) may be different and renew on a separate schedule. Monitor both.
- Mutual TLS (mTLS) client certificates: Client certificates used for mTLS (service-to-service auth) expire just like server certs but are harder to detect because they don't serve browser-visible errors. Add client certs to your inventory.
- Certificate pinning: Some mobile apps and older services pin the certificate or its public key hash. Renewing the cert will break pinned connections. Coordinate with mobile release cycles before renewing pinned certs.
- cert-manager rate limits in staging: Let's Encrypt's production ACME has rate limits (50 certs/registered domain/week). When testing cert-manager configuration changes, always use the staging ACME endpoint (
letsencrypt-staging) to avoid burning production rate limits. - Wildcard renewal requires DNS-01: Let's Encrypt wildcard certs (
*.example.com) can only be validated via DNS-01. HTTP-01 does not work for wildcards. Ensure your cert-manager ClusterIssuer uses a DNS solver for wildcard certs.
Cross-References¶
- Topic Packs: Certificates, security, cert-manager
- Runbooks: cert-emergency-renewal.md, cert-manager-troubleshooting.md
- Related trees: should-i-page.md, config-change.md