Skip to content

Ops Archaeology: The 5% That Can't Resolve

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, CoreDNS, Networking


Artifact 1: CLI Output

$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-4g7h2   1/1     Running   0          21d
coredns-5d78c9869d-9k3m5   1/1     Running   0          21d

$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   180d

$ kubectl exec -n default debug-pod -- nslookup payment-api.payments.svc.cluster.local
Server:     10.96.0.10
Address:    10.96.0.10#53

Name:   payment-api.payments.svc.cluster.local
Address: 10.96.88.142

$ kubectl exec -n default debug-pod -- nslookup payment-api.payments
Server:     10.96.0.10
Address:    10.96.0.10#53

** server can't find payment-api.payments: NXDOMAIN

$ kubectl exec -n default debug-pod -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:15

Artifact 2: Metrics

# CoreDNS metrics
coredns_dns_requests_total{type="A",zone="cluster.local."} 4829147
coredns_dns_responses_total{rcode="NOERROR"} 4587291
coredns_dns_responses_total{rcode="NXDOMAIN"} 241856

# NXDOMAIN rate: ~5% of all queries

# Cache hit ratio
coredns_cache_hits_total{type="success"} 3891024
coredns_cache_misses_total 938123

# Upstream DNS queries (to VPC resolver)
coredns_forward_requests_total{to="169.254.169.253:53"} 482914

Artifact 3: Infrastructure Code

# From: helm/app-values.yaml (for the payment-api deployment)
apiVersion: v1
kind: Pod
metadata:
  name: payment-worker
  namespace: payments
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "15"
  containers:
    - name: worker
      image: registry.corp.io/payment-worker:v4.2.1
      env:
        - name: PAYMENT_API_URL
          value: "http://payment-api.payments:8080/api/v1/charge"
        - name: INVENTORY_API_URL
          value: "http://inventory-api.inventory:8080/api/v1/stock"

Artifact 4: Log Lines

[2024-12-15T08:44:12Z] payment-worker | ERROR DNS resolution failed for payment-api.payments: NXDOMAIN
[2024-12-15T08:44:12Z] coredns        | [INFO] 10.244.2.41:38294 - 29381 "A IN payment-api.payments.default.svc.cluster.local. udp 71 false 512" NXDOMAIN qr,aa,rd 164 0.000221s
[2024-12-15T08:44:11Z] payment-worker | INFO  Successfully charged order ord-88291 via payment-api.payments.svc.cluster.local

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?