Ops Archaeology: The Certificate That Works Sometimes¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L3 Estimated time: 40 min Domains: TLS/PKI, Cert-Manager, Ingress, API Clients
Artifact 1: CLI Output¶
$ kubectl get certificate -n api
NAME READY SECRET AGE
api-tls-cert True api-tls 12d
$ kubectl get certificaterequest -n api
NAME APPROVED DENIED READY ISSUER REQUESTOR AGE
api-tls-cert-h8k2m True True letsencrypt-prod cert-manager/cert-manager-controller 12d
$ kubectl describe certificate api-tls-cert -n api | grep -A3 "Status:"
Status:
Conditions:
Last Transition Time: 2024-12-06T03:14:22Z
Message: Certificate is up to date and has not expired
Reason: Ready
Status: True
Type: Ready
$ openssl s_client -connect api.megacorp.io:443 -servername api.megacorp.io 2>/dev/null | head -20
CONNECTED(00000003)
depth=0 CN = api.megacorp.io
verify error:num=21:unable to verify the first certificate
---
Certificate chain
0 s:CN = api.megacorp.io
i:C = US, O = Let's Encrypt, CN = R11
---
$ curl -v https://api.megacorp.io/health 2>&1 | grep -E "(SSL|HTTP|issuer)"
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* Server certificate:
* subject: CN=api.megacorp.io
* issuer: C=US; O=Let's Encrypt; CN=R11
< HTTP/2 200
Artifact 2: Metrics¶
# Nginx ingress controller metrics (last 24h)
nginx_ingress_controller_requests{status="200",host="api.megacorp.io"} 284719
nginx_ingress_controller_requests{status="502",host="api.megacorp.io"} 0
# cert-manager metrics
certmanager_certificate_ready_status{name="api-tls-cert",namespace="api",condition="True"} 1
certmanager_certificate_expiration_timestamp_seconds{name="api-tls-cert"} 1741219200
# Application error rate from API clients (reported by client teams)
# Python (requests library): 14% failure rate - "SSLError: certificate verify failed"
# Go (net/http): 14% failure rate - "x509: certificate signed by unknown authority"
# Browser (Chrome 120): 0% failure rate
# Browser (Firefox 121): 0% failure rate
# curl (8.4.0, macOS): 0% failure rate
# Java (HttpClient, JDK 17): 14% failure rate
Artifact 3: Infrastructure Code¶
# From: cert-manager/cluster-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: platform-team@megacorp.io
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
---
# From: ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: api
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- api.megacorp.io
secretName: api-tls
rules:
- host: api.megacorp.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080
Artifact 4: Log Lines¶
[2024-12-18T09:14:22Z] partner-integration/java | javax.net.ssl.SSLHandshakeException: PKIX path building failed: unable to find valid certification path to requested target
[2024-12-18T09:14:18Z] cert-manager | I1218 09:14:18.123456 controller.go:248] cert-manager/certificates-readiness "msg"="re-queuing item due to optimistic locking" "key"="api/api-tls-cert"
[2024-12-18T09:12:44Z] nginx-ingress | 52.14.88.210 - - "GET /api/v1/orders HTTP/2" 200 1842 "-" "python-requests/2.31.0" 284 0.012 [api-api-service-8080] [] 10.244.3.18:8080 1842 0.011 200
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?