It Was Always DNS¶
Category: The Mystery Domains: dns, networking Read time: ~5 min
Setting the Scene¶
It was a Tuesday morning in November, and our e-commerce platform was bleeding money. Intermittent 502 errors had been showing up in Datadog for three days -- roughly 4% of requests, no obvious pattern. We had 23 microservices on Kubernetes, a CloudFront CDN in front, an ALB doing TLS termination, and about forty engineers pointing fingers at each other.
I was the on-call SRE, running on my third consecutive night of bad sleep, absolutely convinced the problem was in the application layer.
What Happened¶
The first day, we blamed the app. The checkout service was throwing ECONNREFUSED errors when calling the inventory service. We checked connection pools, thread counts, memory. Everything looked healthy. We restarted pods. The errors continued.
Day two, we pivoted to the load balancer. Maybe the ALB health checks were misconfigured. We spent six hours comparing target group settings, draining connections, switching from round-robin to least-connections. I personally reviewed every listener rule. The error rate didn't budge.
By Wednesday afternoon, someone suggested the CDN. "Maybe CloudFront is caching stale origins." We invalidated everything, disabled caching entirely for an hour. Still 502s. We even swapped the CDN out completely using a DNS failover and routed traffic directly to the ALB. The errors persisted -- but something changed. The pattern shifted slightly. I didn't notice at the time.
A junior engineer asked in Slack: "Has anyone checked DNS?" I dismissed it. We had CoreDNS running in the cluster, it had been stable for months, dashboards showed no errors. I literally typed "it's not DNS" in the channel.
Thursday morning, I was staring at tcpdump output on a pod (tcpdump -i eth0 port 53 -w /tmp/dns.pcap) when I noticed something odd. Some DNS queries were going to an external resolver instead of CoreDNS. Not all of them -- maybe one in twenty.
I dug into the pod's /etc/resolv.conf and found ndots:5 (the Kubernetes default). The inventory service was being called by a short name that had exactly 4 dots in it after the search domain expansion. Some lookups hit the internal search path correctly; others fell through to the upstream resolver, which didn't know about our internal services.
The Moment of Truth¶
I ran nslookup inventory-svc.prod.svc.cluster.local from inside a pod -- instant response. Then nslookup inventory-svc.prod.payments -- it worked but took 3 seconds, hitting the external resolver first before falling back. That 3-second delay was enough to trigger the ALB's 5-second timeout on roughly 4% of requests, depending on how the dots lined up with the search domains.
The fix was two lines in our Helm chart: setting ndots:2 in the pod's dnsConfig and using FQDNs with trailing dots in service calls.
The Aftermath¶
We deployed the fix at 11:42 AM Thursday. By noon, the 502 rate dropped to zero and stayed there. Three days of investigation, four postmortems scheduled, and the answer was a single integer in a DNS config. I went back to that Slack thread and edited my message to: "it was DNS. it's always DNS."
The Lessons¶
- Check DNS first, every time: Before you blame the app, the LB, or the CDN, run
digandnslookupfrom the actual source. It takes 30 seconds. - ndots in Kubernetes is a trap: The default
ndots:5causes unexpected external lookups for dotted service names. Set it explicitly in your pod spec. - Listen to the junior engineer: The person with the "obvious" question is often the one who hasn't been blinded by your assumptions.
What I'd Do Differently¶
Add a DNS resolution check to our standard troubleshooting runbook as step one, not step forty-seven. Set up alerts on CoreDNS upstream query rates. And the moment someone asks "have we checked DNS?" the answer should always be to actually go check, not wave it off.
The Quote¶
"Three days, four teams, two all-hands, and one junior engineer who asked the question I was too proud to ask myself."
Cross-References¶
- Topic Packs: DNS Deep Dive, DNS Operations, Networking Troubleshooting, K8s Networking
- Case Studies: CoreDNS Timeout Pod DNS