Ops Archaeology: The 5% That Can't Resolve¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, CoreDNS, Networking
Artifact 1: CLI Output¶
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-4g7h2 1/1 Running 0 21d
coredns-5d78c9869d-9k3m5 1/1 Running 0 21d
$ kubectl get svc -n kube-system kube-dns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 180d
$ kubectl exec -n default debug-pod -- nslookup payment-api.payments.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: payment-api.payments.svc.cluster.local
Address: 10.96.88.142
$ kubectl exec -n default debug-pod -- nslookup payment-api.payments
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find payment-api.payments: NXDOMAIN
$ kubectl exec -n default debug-pod -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:15
Artifact 2: Metrics¶
# CoreDNS metrics
coredns_dns_requests_total{type="A",zone="cluster.local."} 4829147
coredns_dns_responses_total{rcode="NOERROR"} 4587291
coredns_dns_responses_total{rcode="NXDOMAIN"} 241856
# NXDOMAIN rate: ~5% of all queries
# Cache hit ratio
coredns_cache_hits_total{type="success"} 3891024
coredns_cache_misses_total 938123
# Upstream DNS queries (to VPC resolver)
coredns_forward_requests_total{to="169.254.169.253:53"} 482914
Artifact 3: Infrastructure Code¶
# From: helm/app-values.yaml (for the payment-api deployment)
apiVersion: v1
kind: Pod
metadata:
name: payment-worker
namespace: payments
spec:
dnsConfig:
options:
- name: ndots
value: "15"
containers:
- name: worker
image: registry.corp.io/payment-worker:v4.2.1
env:
- name: PAYMENT_API_URL
value: "http://payment-api.payments:8080/api/v1/charge"
- name: INVENTORY_API_URL
value: "http://inventory-api.inventory:8080/api/v1/stock"
Artifact 4: Log Lines¶
[2024-12-15T08:44:12Z] payment-worker | ERROR DNS resolution failed for payment-api.payments: NXDOMAIN
[2024-12-15T08:44:12Z] coredns | [INFO] 10.244.2.41:38294 - 29381 "A IN payment-api.payments.default.svc.cluster.local. udp 71 false 512" NXDOMAIN qr,aa,rd 164 0.000221s
[2024-12-15T08:44:11Z] payment-worker | INFO Successfully charged order ord-88291 via payment-api.payments.svc.cluster.local
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?