Solution¶

Triage¶

Check CoreDNS pod count and resource usage:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl top pods -n kube-system -l k8s-app=kube-dns

Inspect a pod's DNS configuration:

kubectl exec <pod-name> -- cat /etc/resolv.conf

Check CoreDNS logs for errors:

kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Test DNS from inside a pod:

kubectl exec <pod-name> -- nslookup api.stripe.com
kubectl exec <pod-name> -- nslookup kubernetes.default.svc.cluster.local

Root Cause¶

Two compounding issues:

Undersized CoreDNS. Only 2 replicas serve 200+ pods. Under peak load, the CoreDNS pods cannot process queries fast enough, leading to timeouts.
ndots:5 query amplification. The default Kubernetes resolv.conf sets ndots:5, meaning any name with fewer than 5 dots is treated as a relative name. For a lookup of api.stripe.com (2 dots), the resolver tries:
api.stripe.com.prod.svc.cluster.local
api.stripe.com.svc.cluster.local
api.stripe.com.cluster.local
api.stripe.com (finally, the actual FQDN)

Each attempt generates both A and AAAA queries. One application DNS lookup becomes 8 queries to CoreDNS.

Fix¶

Immediate (reduce load):

Scale CoreDNS:

kubectl scale deployment coredns -n kube-system --replicas=5

Increase CoreDNS resource limits:

kubectl patch deployment coredns -n kube-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"coredns","resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"200m","memory":"128Mi"}}}]}}}}'

Medium-term (reduce query volume):

For pods making heavy external DNS calls, set dnsConfig to reduce ndots:

spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"

Use trailing dots in application configuration for known external hosts:
```
API_HOST=api.stripe.com.
```

Long-term (architectural):

Deploy NodeLocal DNSCache:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

This runs a caching DNS agent on every node, dramatically reducing traffic to CoreDNS.

Set up CoreDNS autoscaling with the cluster-proportional-autoscaler.

Rollback / Safety¶

Scaling CoreDNS up is safe and immediate. Scaling down should be done gradually while monitoring error rates.
Changing ndots affects how internal service names are resolved. Test thoroughly: my-service (no dots) still needs search domains.
NodeLocal DNSCache uses iptables rules to redirect DNS traffic. Test in staging first.

Common Traps¶

Assuming DNS issues are network issues. Packet loss symptoms can be DNS timeout symptoms. Always check DNS first.
Setting ndots to 1. This breaks multi-segment service names like my-service.other-namespace. Use ndots:2 as a safe minimum.
Ignoring AAAA queries. Even if your cluster is IPv4-only, glibc sends AAAA queries by default. This doubles the DNS load. Consider single-request-reopen in dnsConfig options.
Not monitoring CoreDNS. Enable the Prometheus plugin in the Corefile and set up alerts on query latency and error rates.
Enabling the log plugin in CoreDNS production. This logs every query and can itself cause performance degradation.