Symptoms¶
- Application pods intermittently fail to resolve DNS names, both internal services and external domains.
- Error logs from the application show:
dial tcp: lookup api.stripe.com: i/o timeout. - The failures are sporadic -- roughly 10-15% of DNS lookups fail.
- The issue worsens during peak traffic hours (9am-11am).
- CoreDNS pods show elevated CPU usage and request queuing.
- The cluster has 200+ pods across 8 nodes and only 2 CoreDNS replicas.
- Pod resolv.conf shows
ndots:5which is the Kubernetes default.