Why DNS Is Always the Problem

lesson
dns-history
resolution-hierarchy
caching
dnssec
debugging
kubernetes-dns ---# Why DNS Is Always the Problem

Topics: DNS history, resolution hierarchy, caching, DNSSEC, debugging, Kubernetes DNS Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

It's 2am. The page says "This site can't be reached — DNS_PROBE_FINISHED_NXDOMAIN." Your app, your database, your server — all fine. The problem is that nobody can find your server. DNS is broken, and nothing works until it's fixed.

"It's always DNS" isn't a joke. It's a statistical observation. DNS is involved in every network connection, it's cached at a dozen layers, and its failure modes are some of the most confusing in computing. This lesson traces DNS from its origin (a single text file) through the modern hierarchy, and teaches you to debug it systematically.

Before DNS: One File for the Entire Internet¶

Before 1983, every hostname-to-IP mapping on the entire internet was in a single file called HOSTS.TXT, maintained by Elizabeth "Jake" Feinler at Stanford Research Institute.

Every computer on the internet periodically downloaded this file via FTP. If you wanted a new hostname, you called SRI on the phone. By 1983, the internet had grown to several hundred hosts and the system was collapsing:

Adding a host took days (phone call → manual edit → FTP propagation)
The file was too large to distribute efficiently
Name collisions were becoming common (no hierarchy, no namespaces)
A single person was the bottleneck for the entire internet

Paul Mockapetris invented DNS (RFC 882/883, 1983) to replace this. The core insight: distribute the database hierarchically, cache aggressively, and delegate authority.

Trivia: /etc/hosts on your Linux machine is a direct descendant of HOSTS.TXT. It still takes priority over DNS on most systems (controlled by /etc/nsswitch.conf). This pre-DNS relic from 1983 is still actively exploited by malware to hijack domain resolution — and it's still the first thing to check when DNS debugging.

How DNS Actually Works¶

DNS is a hierarchical, distributed database. Resolving app.example.com walks the hierarchy from the root:

Root (.)          → "Who handles .com?"
  ↓
.com (TLD)        → "Who handles example.com?"
  ↓
example.com       → "app.example.com is 203.0.113.50"
(Authoritative)

Maximum 4 queries to resolve any name. In practice, root and TLD answers are almost always cached (TTL 48 hours), so most lookups need 1-2 queries.

# Try it yourself right now — trace a real DNS resolution
dig +trace example.com | head -25
# You'll see: root servers → .com servers → example.com authoritative
# Each line shows who answered and what TTL they gave

The resolver chain¶

When your app calls getaddrinfo("app.example.com"), the request goes through:

Application
  → glibc resolver (reads /etc/resolv.conf)
    → Local cache (systemd-resolved, nscd, or none)
      → Recursive resolver (ISP, 8.8.8.8, 1.1.1.1)
        → Root servers → TLD servers → Authoritative servers

Each layer caches. Each layer can fail independently. Each layer can return a different answer.

# What resolver is the system using?
cat /etc/resolv.conf
# → nameserver 10.0.0.2
# → search example.com

# What does the system resolver return?
getent hosts app.example.com
# This goes through nsswitch.conf → /etc/hosts → DNS

# What does DNS directly return? (bypasses hosts file and cache)
dig +short app.example.com

# What does a specific resolver return?
dig +short app.example.com @8.8.8.8

# Trace the full resolution from root
dig +trace app.example.com

Gotcha: dig and nslookup bypass /etc/hosts and go directly to DNS. If /etc/hosts has an entry for the hostname, your application uses that entry while dig shows the DNS answer. This causes "dig shows the right IP but the app connects to the wrong one" — check /etc/hosts and /etc/nsswitch.conf first.

TTL: The Root of All DNS Pain¶

Every DNS record has a TTL (Time To Live) — how long resolvers are allowed to cache it. This is the source of most DNS problems:

TTL	Good for	Bad for
60s (1 minute)	Fast failover, blue-green deploys	High query load on authoritative servers
300s (5 minutes)	Balance of speed and load	Still slow for emergencies
3600s (1 hour)	Low resolver load	Slow propagation after changes
86400s (24 hours)	Minimal resolver load	Disaster for migrations

The migration trap¶

You're moving servers. The old IP is 10.0.0.1, the new IP is 10.0.0.2. The DNS record has TTL 86400 (24 hours). You update the record at 2pm. What happens?

2:00 PM  — You update the DNS record to 10.0.0.2
2:01 PM  — Some resolvers see the new IP (their cache expired)
6:00 PM  — Most resolvers still returning 10.0.0.1 (cached from this morning)
2:00 AM  — Some ISP resolvers STILL returning 10.0.0.1 (they cache aggressively)
2:00 PM  — 24 hours later, most caches have expired. "Most."

Gotcha: Some ISP resolvers and corporate proxies ignore TTL and cache for their own duration. Even with TTL 60, some clients might see the old IP for hours. You can't control this — you can only minimize the impact by lowering TTL well in advance.

The correct migration pattern:

48 hours before migration:
  → Lower TTL from 86400 to 60
  → Wait 48 hours for old TTL entries to expire everywhere

At migration time:
  → Update the DNS record
  → Most resolvers pick up the new IP within 60 seconds

After migration stabilizes:
  → Raise TTL back to 3600 or higher

DNS in Kubernetes: The ndots Trap¶

Kubernetes has its own DNS (CoreDNS) with its own resolution rules. Every pod gets:

kubectl exec mypod -- cat /etc/resolv.conf
# → nameserver 10.96.0.10
# → search default.svc.cluster.local svc.cluster.local cluster.local
# → options ndots:5

The ndots:5 option means: if a name has fewer than 5 dots, try the search domains first.

app.example.com has 2 dots. So the resolver tries: 1. app.example.com.default.svc.cluster.local → NXDOMAIN 2. app.example.com.svc.cluster.local → NXDOMAIN 3. app.example.com.cluster.local → NXDOMAIN 4. app.example.com → found!

That's 3 wasted queries for every external DNS lookup. At 1,000 external calls per second, that's 3,000 wasted NXDOMAIN queries hammering CoreDNS.

War Story: A team's service had intermittent 3-second delays on external API calls. Everything looked fine — the API was fast, the network was fine. The problem was ndots:5: for service names with exactly 4 dots after search domain expansion, some queries fell through to the external resolver, which added 3+ seconds of latency. Fix: set ndots:2 in the pod's dnsConfig, or add a trailing dot to external hostnames (api.example.com. — the dot makes it fully qualified, skipping search domains).

DNSSEC: The 20-Year Saga¶

DNS has no authentication. A resolver asks "what is app.example.com?" and trusts whatever answer comes back first. An attacker who can race the legitimate answer wins.

DNSSEC (DNS Security Extensions) fixes this by adding cryptographic signatures to DNS records. The resolver can verify that the answer came from the authorized source and wasn't tampered with.

The concept was proposed in 1990. The root zone wasn't signed until July 2010 — a 20-year gap caused by key management complexity, political battles over who controls the root keys, and the classic "who goes first" adoption problem.

Trivia: In 2008, Dan Kaminsky discovered a fundamental cache poisoning vulnerability that affected virtually every DNS implementation. An attacker could forge responses by racing the legitimate answer — and the race was easy to win. The coordinated disclosure was unprecedented: vendors secretly patched before the public announcement. This bug accelerated DNSSEC adoption more than a decade of advocacy had.

Gotcha: If your DNSSEC configuration is wrong — DS record in the parent zone doesn't match the KSK, or signatures expire — DNSSEC-validating resolvers (8.8.8.8, 1.1.1.1) return SERVFAIL. Your domain becomes unreachable to everyone using validating resolvers, while non-validating resolvers still work. This makes debugging maddening: "it works from some networks but not others."

The Debugging Ladder¶

When DNS is broken, work through this:

DNS isn't resolving
│
├── Check /etc/hosts first (it overrides DNS)
│   grep hostname /etc/hosts
│
├── Check /etc/resolv.conf
│   What resolver are we using? Is search domain correct?
│
├── Test with dig (bypasses local caches)
│   dig hostname
│   dig hostname @8.8.8.8     ← different resolver
│   dig +trace hostname        ← full resolution path
│
├── Is it cached? Compare resolvers
│   dig @internal-resolver vs dig @8.8.8.8
│   └── Different answers? → Stale cache. Flush it.
│
├── Is the authoritative server responding?
│   dig hostname @ns1.example.com
│   └── No? → Authoritative DNS is down or misconfigured
│
├── Kubernetes-specific
│   kubectl exec pod -- cat /etc/resolv.conf   ← check ndots, search
│   kubectl exec pod -- nslookup kubernetes.default
│   kubectl logs -n kube-system -l k8s-app=kube-dns
│
└── Is it DNSSEC?
    dig +dnssec hostname
    dig +trace +dnssec hostname
    └── SERVFAIL from validating resolvers only? → DNSSEC broken

Essential DNS commands¶

# Basic lookup
dig +short app.example.com

# See TTL (the number before the record type)
dig app.example.com | grep -v '^;'
# → app.example.com.  287  IN  A  203.0.113.50
#                       ↑ seconds remaining in cache

# Full trace from root
dig +trace app.example.com

# Check from multiple resolvers
for r in 8.8.8.8 1.1.1.1 9.9.9.9; do
    echo -n "$r: "; dig +short app.example.com @$r
done

# Reverse lookup (IP → hostname)
dig -x 203.0.113.50

# Check specific record types
dig MX example.com          # Mail servers
dig TXT example.com         # SPF, DKIM, verification records
dig NS example.com          # Nameservers
dig SOA example.com         # Zone authority (serial number, refresh times)

# Flush local cache
sudo resolvectl flush-caches     # systemd-resolved
sudo nscd -i hosts               # nscd

Flashcard Check¶

Q1: What was HOSTS.TXT?

A single file containing every hostname-to-IP mapping on the internet, maintained by Elizabeth Feinler at SRI until DNS replaced it in 1983. /etc/hosts is its descendant.

Q2: dig shows the right IP but the app connects to the wrong one. Why?

/etc/hosts has a different entry for that hostname. dig bypasses /etc/hosts; applications go through nsswitch.conf which checks /etc/hosts first.

Q3: You change a DNS record with TTL 86400. How long until everyone sees it?

Up to 24 hours (plus whatever ISP resolvers add). Some see it immediately (cache was already expired), some take the full TTL. Lower TTL to 60 at least 48 hours before.

Q4: What is ndots:5 in Kubernetes?

If a hostname has fewer than 5 dots, try the search domains first. Causes 3 wasted NXDOMAIN queries for every external lookup. Fix: set ndots:2 or use trailing dot.

Q5: DNSSEC is broken. What do validating resolvers return?

SERVFAIL. Non-validating resolvers still work, making it seem like "DNS works for some people but not others."

Q6: First .com domain ever registered?

symbolics.com, March 15, 1985. Only 6 .com domains were registered in all of 1985. There are now over 160 million.

Exercises¶

Exercise 1: Trace a resolution (hands-on)¶

dig +trace google.com

Identify: which root server responded? Which TLD server? What was the TTL on the final A record? How many hops?

Exercise 2: Find a stale cache (hands-on)¶

# Check what your local resolver thinks
dig +short example.com

# Check what Google thinks
dig +short example.com @8.8.8.8

# Check the TTL remaining
dig example.com | awk '/^example/ {print "TTL:", $2, "seconds"}'

Exercise 3: The decision (think)¶

For each scenario, what's the first DNS command you'd run?

App can't reach api.partner.com but can reach google.com
Website works in Chrome but not in curl
Users in Asia can reach your site but users in Europe can't
Kubernetes pod can reach 10.0.1.5 directly but not mydb.default.svc.cluster.local

Answers

1. `dig +short api.partner.com @8.8.8.8` — Is it a DNS problem at all, or can nobody resolve it? Then `dig +trace` to find where it breaks. 2. `cat /etc/hosts` — Chrome uses the OS resolver, but curl might be configured differently. Also check if Chrome has its own DNS cache (`chrome://net-internals/#dns`). 3. `dig +short yoursite.com @resolver-in-europe` vs `@resolver-in-asia` — GeoDNS returning different answers? Check from multiple global resolvers. 4. `kubectl exec pod -- nslookup mydb.default.svc.cluster.local` and check `resolv.conf` — is CoreDNS running? Does the service exist? Check `ndots` and search domains.

Cheat Sheet¶

Task	Command
Basic lookup	`dig +short hostname`
Full trace from root	`dig +trace hostname`
Check specific resolver	`dig hostname @8.8.8.8`
See TTL	`dig hostname \\| awk '/^[^;]/ {print $2}'`
Reverse lookup	`dig -x IP`
Check all record types	`dig ANY hostname`
Flush systemd-resolved	`sudo resolvectl flush-caches`
K8s DNS debug	`kubectl exec pod -- nslookup kubernetes.default`
K8s CoreDNS logs	`kubectl logs -n kube-system -l k8s-app=kube-dns`

Takeaways¶

Check /etc/hosts first. It overrides DNS. This single file has caused more debugging hours than any actual DNS outage.
TTL is the migration lever. Lower it 48 hours before changes, not at change time.
ndots:5 in Kubernetes wastes queries. Every external lookup triggers 3 extra NXDOMAIN queries. Set ndots:2 or use trailing dots on FQDNs.
dig bypasses local caches. If dig and your app disagree, the problem is between the app and DNS (hosts file, nsswitch, local cache) — not DNS itself.
DNSSEC failures are partial. Validating resolvers return SERVFAIL while non-validating resolvers work fine. This makes the problem look intermittent.

What Happens When You Click a Link — DNS is step 2 of 9
Connection Refused — when DNS resolves to the wrong IP