Solution¶
Triage¶
- Check CoreDNS pod count and resource usage:
- Inspect a pod's DNS configuration:
- Check CoreDNS logs for errors:
- Test DNS from inside a pod:
Root Cause¶
Two compounding issues:
-
Undersized CoreDNS. Only 2 replicas serve 200+ pods. Under peak load, the CoreDNS pods cannot process queries fast enough, leading to timeouts.
-
ndots:5 query amplification. The default Kubernetes
resolv.confsetsndots:5, meaning any name with fewer than 5 dots is treated as a relative name. For a lookup ofapi.stripe.com(2 dots), the resolver tries: api.stripe.com.prod.svc.cluster.localapi.stripe.com.svc.cluster.localapi.stripe.com.cluster.localapi.stripe.com(finally, the actual FQDN)
Each attempt generates both A and AAAA queries. One application DNS lookup becomes 8 queries to CoreDNS.
Fix¶
Immediate (reduce load):
- Scale CoreDNS:
- Increase CoreDNS resource limits:
Medium-term (reduce query volume):
-
For pods making heavy external DNS calls, set
dnsConfigto reduce ndots: -
Use trailing dots in application configuration for known external hosts:
Long-term (architectural):
-
Deploy NodeLocal DNSCache:
This runs a caching DNS agent on every node, dramatically reducing traffic to CoreDNS. -
Set up CoreDNS autoscaling with the
cluster-proportional-autoscaler.
Rollback / Safety¶
- Scaling CoreDNS up is safe and immediate. Scaling down should be done gradually while monitoring error rates.
- Changing ndots affects how internal service names are resolved. Test thoroughly:
my-service(no dots) still needs search domains. - NodeLocal DNSCache uses iptables rules to redirect DNS traffic. Test in staging first.
Common Traps¶
- Assuming DNS issues are network issues. Packet loss symptoms can be DNS timeout symptoms. Always check DNS first.
- Setting ndots to 1. This breaks multi-segment service names like
my-service.other-namespace. Use ndots:2 as a safe minimum. - Ignoring AAAA queries. Even if your cluster is IPv4-only, glibc sends AAAA queries by default. This doubles the DNS load. Consider
single-request-reopenin dnsConfig options. - Not monitoring CoreDNS. Enable the Prometheus plugin in the Corefile and set up alerts on query latency and error rates.
- Enabling the
logplugin in CoreDNS production. This logs every query and can itself cause performance degradation.