Skip to content

Symptoms

  • Application pods intermittently fail to resolve DNS names, both internal services and external domains.
  • Error logs from the application show: dial tcp: lookup api.stripe.com: i/o timeout.
  • The failures are sporadic -- roughly 10-15% of DNS lookups fail.
  • The issue worsens during peak traffic hours (9am-11am).
  • CoreDNS pods show elevated CPU usage and request queuing.
  • The cluster has 200+ pods across 8 nodes and only 2 CoreDNS replicas.
  • Pod resolv.conf shows ndots:5 which is the Kubernetes default.