Decision Tree: Certificate Is Expiring — What Do I Do?¶

Category: Operational Decisions Starting Question: "A TLS certificate is expiring — what's the renewal path?" Estimated traversal: 3-5 minutes Domains: certificates, TLS, PKI, cert-manager, security, SRE

The Tree¶

A TLS certificate is expiring — what's the renewal path?
│
├── [Check 1] How much time until expiry?
│   │
│   ├── EXPIRED already (or < 24 hours)
│   │   └── ⚠️ THIS IS AN INCIDENT — treat as P1 if user-facing
│   │       ├── Is the service currently returning TLS errors to users?
│   │       │   ├── YES → ✅ EMERGENCY RENEWAL (see terminal action)
│   │       │   │         Page on-call immediately if not already on it
│   │       │   └── NO (cert expired but service still serving — cached/stapled)
│   │       │       └── Renewal is still URGENT — proceed as < 7 days
│   │
│   ├── < 7 days (critical — act today)
│   │   ├── [Check 2] What type of certificate?
│   │   │   ├── Let's Encrypt (via cert-manager)
│   │   │   │   ├── [Check 3] Is cert-manager Certificate object healthy?
│   │   │   │   │   ├── YES (Ready=True, no ACME errors) → auto-renewal in progress
│   │   │   │   │   │   └── → ✅ MONITOR — check again in 1 hour, act if not renewed
│   │   │   │   │   └── NO (Ready=False, errors present)
│   │   │   │   │       ├── [Check 4] What is the ACME challenge failing on?
│   │   │   │   │       │   ├── HTTP-01 challenge failing
│   │   │   │   │       │   │   ├── Is the ingress correctly routing /.well-known/acme-challenge/?
│   │   │   │   │       │   │   │   └── → ✅ FIX INGRESS ANNOTATION or path routing
│   │   │   │   │       │   ├── DNS-01 challenge failing
│   │   │   │   │       │   │   ├── Is the DNS provider secret correct?
│   │   │   │   │       │   │   │   ├── YES → check DNS propagation (TTL may be slow)
│   │   │   │   │       │   │   │   └── NO → ✅ UPDATE DNS PROVIDER SECRET + retry
│   │   │   │   │       │   ├── Rate limit hit (429 from ACME)
│   │   │   │   │       │   │   └── → ✅ USE STAGING ACME TO TEST, wait 1 hour for limit reset
│   │   │   │   │       │   └── Unknown error
│   │   │   │   │       │       └── → ✅ CHECK CERT-MANAGER LOGS + escalate to platform team
│   │   │   │
│   │   │   ├── ACM (AWS Certificate Manager)
│   │   │   │   ├── Is DNS validation CNAME present in Route53?
│   │   │   │   │   ├── YES → ACM auto-renews; check ACM console for status
│   │   │   │   │   └── NO → ✅ ADD VALIDATION CNAME to DNS (see terminal action)
│   │   │   │   └── Is the cert attached to an ALB / CloudFront?
│   │   │   │       ├── YES + DNS valid → auto-renewal should work; check ACM console
│   │   │   │       └── NO → cert won't auto-renew; attach to a service or renew manually
│   │   │   │
│   │   │   ├── Self-signed (internal, dev, or legacy service)
│   │   │   │   └── → ✅ MANUAL RENEWAL — generate new cert + update secret (see action)
│   │   │   │
│   │   │   └── CA-signed (OV/EV, internal PKI, vendor-issued)
│   │   │       ├── Is there an internal CA with automated issuance?
│   │   │       │   ├── YES → ✅ REQUEST NEW CERT from internal CA automation
│   │   │       │   └── NO → ✅ ESCALATE TO CA/VENDOR — OV/EV takes 1–7 business days
│   │   │
│   ├── 7–30 days (urgent — schedule renewal this week)
│   │   ├── [Check 2] What type of certificate?
│   │   │   ├── Let's Encrypt → [Check 3] (same as above, should auto-renew at 2/3 of lifetime)
│   │   │   ├── ACM → verify DNS validation, then auto-renews
│   │   │   ├── Self-signed → ✅ SCHEDULE MANUAL RENEWAL in next sprint
│   │   │   └── CA-signed (OV/EV) → ✅ INITIATE REQUEST NOW — lead time may be > 7 days
│   │   │
│   │   ├── [Check 5] Does the cert cover all required SANs?
│   │   │   ├── YES → proceed with renewal
│   │   │   └── NO (new domains added, old domains removed)
│   │   │       └── ✅ RENEWAL IS A REISSUANCE — coordinate SAN update with domain owners
│   │   │
│   │   └── [Check 6] Does the cert cover all required domains / wildcard?
│   │       ├── Single domain cert → confirm domain still in use
│   │       └── Wildcard cert → confirm all subdomains are still needed
│   │
│   └── > 30 days (planned — no urgency)
│       ├── Is auto-renewal configured and working?
│       │   ├── YES → log it and check again at 30-day mark
│       │   └── NO → ✅ FIX AUTO-RENEWAL NOW while you have time
│       │
│       └── [Check 7] Is this cert in your certificate inventory?
│           ├── YES → schedule renewal reminder for 30-day mark
│           └── NO → ✅ ADD TO CERT INVENTORY (prevent future surprise expiries)

Node Details¶

Check 1: Time until expiry¶

Command/method:

# Check cert expiry from command line
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# Check all certs in a namespace
kubectl get certificate -n production -o wide

# Check cert expiry across all namespaces
kubectl get certificate --all-namespaces \
  -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}{end}' \
  | sort -k3

# Days until expiry for a specific domain
python3 -c "
import ssl, socket, datetime
ctx = ssl.create_default_context()
conn = ctx.wrap_socket(socket.socket(), server_hostname='example.com')
conn.connect(('example.com', 443))
cert = conn.getpeercert()
expiry = datetime.datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')
print(f'Expires: {expiry} ({(expiry - datetime.datetime.now(datetime.UTC)).days} days)')
"

What you're looking for: Days until expiry drives the urgency tier. < 7 days = act today, regardless of other priorities. Common pitfall: The certificate that appears in your monitoring is the CDN/load balancer certificate, not the origin certificate. Check both. CDN certs are often managed separately.

Check 2: Certificate type identification¶

Command/method:

# Check cert-manager Certificate object
kubectl get certificate -n production -o yaml | grep -A10 "issuerRef:"
# Look for: issuerRef.kind = ClusterIssuer and issuerRef.name = letsencrypt-prod

# Check if ACM cert (look for ARN)
aws acm list-certificates --query 'CertificateSummaryList[?DomainName==`example.com`]'

# Check certificate issuer from the live cert
echo | openssl s_client -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -issuer
# Let's Encrypt: "issuer= /C=US/O=Let's Encrypt/CN=R3"
# ACM: "issuer= /C=US/O=Amazon/OU=Server CA 1B/CN=Amazon"
# Self-signed: issuer == subject

What you're looking for: issuerRef in cert-manager = automated. ACM ARN = automated if DNS validation cname is present. Self-signed or private CA = manual renewal. Common pitfall: A certificate deployed to Kubernetes as a Secret (TLS type) may have been originally issued by cert-manager but is now manually managed (cert-manager Certificate object was deleted). Check for both the Secret and the Certificate object.

Check 3: cert-manager Certificate object health¶

Command/method:

# Check Certificate status
kubectl describe certificate myapp-tls -n production
# Look for: Status: True, Type: Ready
# And: Message: Certificate is up to date and has not expired

# Check CertificateRequest objects
kubectl get certificaterequest -n production -l \
  "cert-manager.io/certificate-name=myapp-tls" --sort-by=.metadata.creationTimestamp

# Check Order objects (for ACME)
kubectl get order -n production

# Check Challenge objects
kubectl get challenge -n production
kubectl describe challenge -n production | grep -A10 "Status:"

What you're looking for: Ready=True and Not After in the future = healthy. Any False condition or pending Challenge objects = auto-renewal is failing. Common pitfall: A Certificate shows Ready=True but the Secret was manually updated with an older cert. The Certificate object status reflects the cert-manager state, not necessarily the actual Secret contents. Verify the Secret's cert expiry independently.

Check 4: ACME challenge diagnosis¶

Command/method:

# HTTP-01 challenge: test that the challenge path is reachable
# cert-manager places a token at /.well-known/acme-challenge/<token>
curl -v http://example.com/.well-known/acme-challenge/test-token

# Check Ingress annotations for HTTP-01 solver
kubectl get ingress -n production -o yaml | grep -A5 "acme-challenge"

# DNS-01 challenge: verify TXT record was created
dig _acme-challenge.example.com TXT +short

# Check DNS provider secret is valid
kubectl get secret -n cert-manager | grep "route53\|clouddns\|cloudflare"
kubectl get secret route53-credentials -n cert-manager -o jsonpath='{.data.secret-access-key}' | base64 -d | wc -c

# cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --since=1h | grep -i "error\|fail\|challenge"

# ACME rate limit check (429 responses)
kubectl logs -n cert-manager -l app=cert-manager --since=1h | grep "429\|rate limit"

What you're looking for: HTTP-01 — the challenge URL must return 200. DNS-01 — TXT record must be resolvable by ACME servers. Rate limit — check if you've exceeded 50 certs/domain/week from Let's Encrypt. Common pitfall: HTTP-01 fails when the service is behind a WAF or redirects HTTP to HTTPS before ACME can verify. Use DNS-01 for services that redirect all HTTP traffic.

Check 5: SAN coverage verification¶

Command/method:

# List all SANs in the current certificate
echo | openssl s_client -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -text | grep -A5 "Subject Alternative Name"

# Compare with what's in the cert-manager Certificate spec
kubectl get certificate myapp-tls -n production \
  -o jsonpath='{.spec.dnsNames[*]}'

# Check for recently added virtual hosts / subdomains
kubectl get ingress -n production \
  -o jsonpath='{range .items[*]}{.spec.rules[*].host}{"\n"}{end}' | sort -u

What you're looking for: Every domain the service answers to must be in the cert's SANs. If api.example.com was added to an ingress but not to the certificate's dnsNames, that subdomain will fail TLS. Common pitfall: Wildcard certs (*.example.com) do NOT cover the apex domain (example.com). If both are needed, both must be listed explicitly.

Terminal Actions¶

✅ Action: Emergency Renewal (expired or < 24 hours)¶

Do:

# 1. Generate a new certificate immediately (manual path)
# For Let's Encrypt via certbot (standalone mode — temporary HTTP server)
certbot certonly --standalone \
  -d example.com \
  --email admin@example.com \
  --agree-tos \
  --non-interactive

# 2. Update Kubernetes Secret with new cert
kubectl create secret tls myapp-tls \
  --cert=/etc/letsencrypt/live/example.com/fullchain.pem \
  --key=/etc/letsencrypt/live/example.com/privkey.pem \
  -n production \
  --dry-run=client -o yaml | kubectl apply -f -

# 3. Roll pods to pick up new cert (if mounted as Secret volume)
kubectl rollout restart deployment/myapp -n production

# 4. Verify new cert is live
echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

# 5. Page on-call if not already handling — this was an incident

Verify: notAfter date is in the future, TLS handshake completes without errors, browser shows valid cert. Runbook: cert-emergency-renewal.md

✅ Action: Verify cert-manager Certificate Object and Fix Challenge¶

Do:

# 1. Delete stuck Challenge/Order to force re-attempt
kubectl delete challenge -n production --all
kubectl delete order -n production --all

# 2. Force cert-manager to re-process the Certificate
kubectl annotate certificate myapp-tls -n production \
  cert-manager.io/issue-temporary-certificate="true"

# OR: delete and recreate the Certificate object from your manifest
kubectl delete certificate myapp-tls -n production
kubectl apply -f kubernetes/production/certificate-myapp.yaml

# 3. Watch the new Order and Challenge
kubectl get challenge -n production -w

# 4. Check cert-manager controller logs in real time
kubectl logs -n cert-manager -l app=cert-manager -f | grep -i "example.com"

# 5. Verify certificate issued
kubectl get certificate myapp-tls -n production -o wide

Verify: Certificate shows Ready=True and Not After is at least 60 days in the future. Runbook: cert-manager-troubleshooting.md

✅ Action: Add ACM DNS Validation CNAME¶

Do:

# 1. Get the CNAME name and value from ACM
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/abc123 \
  --query 'Certificate.DomainValidationOptions[*].ResourceRecord'

# 2. Add the CNAME to Route53 (or your DNS provider)
aws route53 change-resource-record-sets \
  --hosted-zone-id ZONE_ID \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "_acme-challenge.example.com",
        "Type": "CNAME",
        "TTL": 300,
        "ResourceRecords": [{"Value": "validation-token.acm-validations.aws"}]
      }
    }]
  }'

# 3. Wait for DNS propagation (up to 5 minutes for low TTL)
dig _acme-challenge.example.com CNAME +short

# 4. Check ACM console — status should change to "Issued" within 30 minutes
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:us-east-1:123456789012:certificate/abc123 \
  --query 'Certificate.Status'

Verify: ACM console shows Status: ISSUED. This CNAME persists and enables future auto-renewals — do not remove it.

✅ Action: Manual Renewal for Self-Signed or Internal CA¶

Do:

# 1. Generate new private key and CSR
openssl genrsa -out new.key 4096
openssl req -new -key new.key -out new.csr \
  -subj "/CN=example.com/O=MyOrg/C=US" \
  -addext "subjectAltName=DNS:example.com,DNS:www.example.com"

# 2a. Self-signed: sign with your own key
openssl x509 -req -days 365 -in new.csr -signkey new.key -out new.crt \
  -extfile <(printf "subjectAltName=DNS:example.com,DNS:www.example.com")

# 2b. Internal CA: send CSR to CA
# Submit new.csr to your internal CA portal or team

# 3. Update Kubernetes Secret
kubectl create secret tls myapp-tls \
  --cert=new.crt --key=new.key \
  -n production \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Verify expiry
openssl x509 -noout -dates -in new.crt

# 5. Restart pods to load new cert
kubectl rollout restart deployment/myapp -n production

Verify: New cert has at least 365 days validity, all required SANs present, pods serving new cert.

✅ Action: Add to Certificate Inventory¶

Do:

# Add to your cert tracking spreadsheet / database / wiki
# Minimum fields:
# - Domain(s) / SANs
# - Issuer type (Let's Encrypt, ACM, self-signed, CA vendor)
# - Expiry date
# - Owner / team
# - Auto-renewal: yes/no + mechanism
# - Kubernetes namespace and Secret name if applicable
# - Last renewed date
# - Alert: configured at 30 days before expiry

# Set up a monitoring alert for this cert
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-expiry-myapp
  namespace: monitoring
spec:
  groups:
  - name: cert-expiry
    rules:
    - alert: CertExpiringIn30Days
      expr: probe_ssl_earliest_cert_expiry{job="blackbox",instance="https://example.com"} - time() < 30 * 24 * 3600
      labels:
        severity: warning
      annotations:
        summary: "TLS cert for example.com expires in < 30 days"
EOF

Verify: Certificate appears in inventory with correct expiry date and owner. Alert fires when tested with a cert with an artificial short validity.

⚠️ Warning: OV/EV Certificate Lead Time¶

When: The expiring cert is an OV (Organization Validated) or EV (Extended Validation) certificate, which requires manual verification by the CA. Risk: OV certs take 1–7 business days; EV certs can take up to 14 business days. If you wait until < 7 days before expiry, you will have a gap. Mitigation: Initiate OV/EV renewal at the 30-day mark, not the 7-day mark. If < 7 days remain, request emergency CA expediting and simultaneously deploy a Let's Encrypt cert as a stopgap (if DV is acceptable for your trust requirements).

Edge Cases¶

CDN-terminated TLS: When a CDN (Cloudflare, Fastly, Akamai) terminates TLS, the CDN's certificate is what users see. The origin certificate (between CDN and your servers) may be different and renew on a separate schedule. Monitor both.
Mutual TLS (mTLS) client certificates: Client certificates used for mTLS (service-to-service auth) expire just like server certs but are harder to detect because they don't serve browser-visible errors. Add client certs to your inventory.
Certificate pinning: Some mobile apps and older services pin the certificate or its public key hash. Renewing the cert will break pinned connections. Coordinate with mobile release cycles before renewing pinned certs.
cert-manager rate limits in staging: Let's Encrypt's production ACME has rate limits (50 certs/registered domain/week). When testing cert-manager configuration changes, always use the staging ACME endpoint (letsencrypt-staging) to avoid burning production rate limits.
Wildcard renewal requires DNS-01: Let's Encrypt wildcard certs (*.example.com) can only be validated via DNS-01. HTTP-01 does not work for wildcards. Ensure your cert-manager ClusterIssuer uses a DNS solver for wildcard certs.

Cross-References¶

Topic Packs: Certificates, security, cert-manager
Runbooks: cert-emergency-renewal.md, cert-manager-troubleshooting.md
Related trees: should-i-page.md, config-change.md

Decision Tree: Certificate Is Expiring — What Do I Do?¶

The Tree¶

Node Details¶

Check 1: Time until expiry¶

Check 2: Certificate type identification¶

Check 3: cert-manager Certificate object health¶

Check 4: ACME challenge diagnosis¶

Check 5: SAN coverage verification¶

Terminal Actions¶

✅ Action: Emergency Renewal (expired or < 24 hours)¶

✅ Action: Verify cert-manager Certificate Object and Fix Challenge¶

✅ Action: Add ACM DNS Validation CNAME¶

✅ Action: Manual Renewal for Self-Signed or Internal CA¶

✅ Action: Add to Certificate Inventory¶

⚠️ Warning: OV/EV Certificate Lead Time¶

Edge Cases¶

Cross-References¶

Pages that link here¶