Skip to content

Ops Archaeology: The Certificate That Works Sometimes

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L3 Estimated time: 40 min Domains: TLS/PKI, Cert-Manager, Ingress, API Clients


Artifact 1: CLI Output

$ kubectl get certificate -n api
NAME            READY   SECRET          AGE
api-tls-cert    True    api-tls         12d

$ kubectl get certificaterequest -n api
NAME                  APPROVED   DENIED   READY   ISSUER           REQUESTOR                                         AGE
api-tls-cert-h8k2m    True                True    letsencrypt-prod cert-manager/cert-manager-controller               12d

$ kubectl describe certificate api-tls-cert -n api | grep -A3 "Status:"
Status:
  Conditions:
    Last Transition Time:  2024-12-06T03:14:22Z
    Message:               Certificate is up to date and has not expired
    Reason:                Ready
    Status:                True
    Type:                  Ready

$ openssl s_client -connect api.megacorp.io:443 -servername api.megacorp.io 2>/dev/null | head -20
CONNECTED(00000003)
depth=0 CN = api.megacorp.io
verify error:num=21:unable to verify the first certificate
---
Certificate chain
 0 s:CN = api.megacorp.io
   i:C = US, O = Let's Encrypt, CN = R11
---

$ curl -v https://api.megacorp.io/health 2>&1 | grep -E "(SSL|HTTP|issuer)"
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* Server certificate:
*  subject: CN=api.megacorp.io
*  issuer: C=US; O=Let's Encrypt; CN=R11
< HTTP/2 200

Artifact 2: Metrics

# Nginx ingress controller metrics (last 24h)
nginx_ingress_controller_requests{status="200",host="api.megacorp.io"} 284719
nginx_ingress_controller_requests{status="502",host="api.megacorp.io"} 0

# cert-manager metrics
certmanager_certificate_ready_status{name="api-tls-cert",namespace="api",condition="True"} 1
certmanager_certificate_expiration_timestamp_seconds{name="api-tls-cert"} 1741219200

# Application error rate from API clients (reported by client teams)
# Python (requests library): 14% failure rate - "SSLError: certificate verify failed"
# Go (net/http): 14% failure rate - "x509: certificate signed by unknown authority"
# Browser (Chrome 120): 0% failure rate
# Browser (Firefox 121): 0% failure rate
# curl (8.4.0, macOS): 0% failure rate
# Java (HttpClient, JDK 17): 14% failure rate

Artifact 3: Infrastructure Code

# From: cert-manager/cluster-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: platform-team@megacorp.io
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx
---
# From: ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  namespace: api
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts:
        - api.megacorp.io
      secretName: api-tls
  rules:
    - host: api.megacorp.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 8080

Artifact 4: Log Lines

[2024-12-18T09:14:22Z] partner-integration/java | javax.net.ssl.SSLHandshakeException: PKIX path building failed: unable to find valid certification path to requested target
[2024-12-18T09:14:18Z] cert-manager            | I1218 09:14:18.123456 controller.go:248] cert-manager/certificates-readiness "msg"="re-queuing item due to optimistic locking" "key"="api/api-tls-cert"
[2024-12-18T09:12:44Z] nginx-ingress           | 52.14.88.210 - - "GET /api/v1/orders HTTP/2" 200 1842 "-" "python-requests/2.31.0" 284 0.012 [api-api-service-8080] [] 10.244.3.18:8080 1842 0.011 200

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?