DNS Operations: When nslookup Isn't Enough

lesson
dns
networking
route-53
kubernetes
coredns
dnssec
debugging ---# DNS Operations — When nslookup Isn't Enough

Topics: DNS, networking, Route 53, Kubernetes, CoreDNS, DNSSEC, debugging Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)

The Mission¶

A developer on your team pings you at 10:15 AM: "The website loads fine for me but three customers in Europe say they get a 'server not found' error. Can you check if the site is down?"

You open the site. It loads instantly. You check the status page — all green. You ping the server — it responds. But the customers are sending screenshots of browser errors, and they're not making it up.

This is a DNS problem. Specifically, it's the kind of DNS problem that nslookup can't diagnose, that "flush your cache" won't fix, and that makes experienced engineers mutter "it's always DNS" while reaching for dig.

We'll diagnose this incident step by step. Each section peels back another layer of how DNS actually works — from dig basics through TTL caching, Route 53 health checks, Kubernetes CoreDNS traps, and DNSSEC.

Part 1: Your First Move — dig vs nslookup vs host¶

Three tools, wildly different usefulness.

# nslookup — the one everyone knows
nslookup app.example.com
# Server:    127.0.0.53
# Address:   127.0.0.53#53
# Non-authoritative answer:
# Name:  app.example.com
# Address: 203.0.113.10

# host — slightly better, same limitations
host app.example.com
# app.example.com has address 203.0.113.10

# dig — the one that actually helps
dig app.example.com

# ;; ANSWER SECTION:
# app.example.com.    247  IN  A  203.0.113.10
#
# ;; Query time: 12 msec
# ;; SERVER: 127.0.0.53#53(127.0.0.53)
# ;; WHEN: Tue Mar 23 10:17:32 UTC 2026
# ;; MSG SIZE  rcvd: 62

The dig output shows the TTL remaining (247 seconds), the record type (A), the server that answered, and the query time. nslookup shows almost none of this.

Feature	nslookup	host	dig
Shows TTL	No	No	Yes
Shows authoritative vs cached	Vaguely	No	Yes (AA flag)
Query specific server	Yes	Yes	Yes
Trace full resolution path	No	No	Yes (`+trace`)
Show DNSSEC info	No	No	Yes (`+dnssec`)
Query specific record type	Clunky	Yes	Yes
Scriptable output	No	Somewhat	Yes (`+short`)
Available everywhere	Yes	Usually	Usually

Name Origin: dig stands for Domain Information Groper. It was written as part of BIND (Berkeley Internet Name Domain) — the DNS server software that has run the majority of the internet's DNS since the 1980s. BIND itself was written by four UC Berkeley graduate students in the early 1980s as part of a DARPA grant. The daemon is called named — literally "name daemon."

The dig flags you'll use every week¶

dig app.example.com +short          # Just the IP, nothing else
dig app.example.com @8.8.8.8       # Ask Google's resolver instead of yours
dig app.example.com +trace          # Walk the entire resolution chain
dig -x 203.0.113.10                # Reverse lookup (IP → name)
dig example.com MX +short          # Mail servers
dig example.com NS +short          # Nameservers
dig example.com SOA                # Zone metadata (serial number, TTLs)
dig example.com ANY +noall +answer # All record types
dig app.example.com +dnssec        # Show DNSSEC signatures
dig @ns1.example.com example.com AXFR  # Zone transfer (if allowed)

Back to the mission¶

Your first diagnostic move:

# What do YOU see?
dig app.example.com +short
# 203.0.113.10

# What does Google's resolver see?
dig app.example.com @8.8.8.8 +short
# 203.0.113.10

# What does Cloudflare's resolver see?
dig app.example.com @1.1.1.1 +short
# 203.0.113.10

# What do the authoritative nameservers say?
dig app.example.com NS +short
# ns-1234.awsdns-56.org.
# ns-789.awsdns-01.co.uk.
# ns-456.awsdns-23.com.
# ns-012.awsdns-78.net.

# Query each one directly
dig app.example.com @ns-1234.awsdns-56.org +short
# 203.0.113.10
dig app.example.com @ns-789.awsdns-01.co.uk +short
# SERVFAIL

There it is. One of the four authoritative nameservers is returning SERVFAIL. Three work, one doesn't. If you're lucky, your resolver picks a working one. If you're not, you get "server not found."

Mental Model: DNS failure is probabilistic. With four nameservers and one broken, roughly 25% of queries fail. But it's not a clean 25% — resolvers often stick with a nameserver that worked recently, so some users are fine for hours while others fail repeatedly. This is why "it works for me" is the most dangerous sentence in DNS debugging.

Flashcard Check #1¶

Question	Answer
What does `dig +short` do?	Returns only the answer (IP address), no headers or metadata
Why is nslookup insufficient for production debugging?	It hides TTL, doesn't trace resolution paths, can't show DNSSEC info
If 1 of 4 authoritative nameservers is broken, what percentage of users are affected?	Roughly 25%, but it's probabilistic — some users may be stuck on the broken NS
What flag asks dig to use a specific resolver?	`@` followed by the resolver IP: `dig @8.8.8.8 example.com`

Part 2: The DNS Record Types That Break Things¶

Every DNS record type exists because someone needed to solve a specific problem. Here's every type you'll encounter in production, what it does, and what breaks when it's wrong.

The core records¶

Type	What it does	Example	What breaks when it's wrong
A	Maps name to IPv4	`app.example.com. 300 IN A 203.0.113.10`	Site unreachable, wrong server
AAAA	Maps name to IPv6	`app.example.com. 300 IN AAAA 2001:db8::1`	IPv6 users fail, happy eyeballs delays
CNAME	Alias to another name	`www.example.com. 300 IN CNAME app.example.com.`	Broken chains, apex violations
MX	Mail exchange server	`example.com. 3600 IN MX 10 mail.example.com.`	All email delivery fails
TXT	Arbitrary text	`"v=spf1 mx -all"`	SPF/DKIM/domain verification fails
NS	Nameserver delegation	`example.com. 86400 IN NS ns1.example.com.`	Entire zone unreachable
SOA	Start of authority	Serial, refresh, retry, expire, min TTL	Secondaries out of sync
PTR	Reverse lookup (IP→name)	`10.113.0.203.in-addr.arpa. PTR app.example.com.`	Email rejected, SSH warnings
SRV	Service location	`_http._tcp.example.com. SRV 10 60 8080 app1.example.com.`	Service discovery breaks
CAA	Certificate authority auth	`example.com. CAA 0 issue "letsencrypt.org"`	TLS cert issuance blocked

CNAME: the record that causes the most grief¶

A CNAME says "this name is an alias for that name." Simple concept, three sharp edges:

Edge 1: CNAME can't coexist with other records. If app.example.com has a CNAME, it cannot also have an A, MX, TXT, or any other record. This is per RFC 1034. Some DNS servers enforce it; others silently do unpredictable things.

Edge 2: CNAME at the zone apex is illegal. You can't do example.com. IN CNAME mycdn.cloudfront.net. because the apex must have SOA and NS records, and CNAME can't coexist with those. Route 53 invented "alias records" to solve this. Cloudflare calls it "CNAME flattening." Neither is standard DNS.

Edge 3: CNAME chains slow resolution. www → app → lb → actual IP means three lookups. Some resolvers follow chains poorly.

Gotcha: If you create a CNAME for app.example.com but forget to remove the existing A record, BIND rejects the zone file. But some cloud DNS providers silently accept the conflict and return unpredictable results — sometimes the A, sometimes the CNAME target.

SOA: the record nobody reads until secondaries stop updating¶

example.com. IN SOA ns1.example.com. admin.example.com. (
    2026032301  ; Serial — MUST increment on every change
    3600        ; Refresh — how often secondaries check for updates (1h)
    900         ; Retry — retry interval if refresh fails (15m)
    604800      ; Expire — secondaries stop serving after this (7d)
    300         ; Minimum TTL — negative caching duration (5m)
)

The serial number is the most operationally important field. If you edit a zone file and forget to increment it, secondary nameservers think they already have the latest version and ignore the update.

Remember: Use the date-based serial format YYYYMMDDNN (e.g., 2026032301 for the first change on March 23, 2026). The NN suffix allows 99 changes per day. Never use arbitrary numbers — you can't go backward, and if you accidentally set the serial to 9999999999, recovering requires manual intervention on every secondary.

PTR: the invisible record that breaks email¶

PTR records map IPs back to names (the reverse of A records). Email servers check: does the connecting IP have a PTR? Does it match the sending domain? If not, your email is spam. Check with dig -x 203.0.113.10.

Part 3: TTL, Caching, and the "Works for Me" Problem¶

TTL (Time-To-Live) is the number of seconds a resolver is allowed to cache a DNS answer. It's the single most important operational parameter in DNS, and it's the reason your website works for you but not for customers in Europe.

Where caching happens¶

Your browser        → has its own DNS cache (Chrome: chrome://net-internals/#dns)
Your OS             → has a stub resolver cache (systemd-resolved, macOS mDNSResponder)
Your local resolver → has a recursive resolver cache (your ISP, corporate DNS, 8.8.8.8)
The TLD servers     → cache delegation answers
Each layer          → respects TTL independently

When you change a DNS record, every cached copy at every layer has to expire before all users see the new answer. This is what people call "DNS propagation" — but that term is misleading.

Mental Model: DNS doesn't "propagate" like a wave spreading outward. There's no push mechanism. Each cache independently expires and re-fetches. "Propagation time" is really "maximum TTL across all caching layers." A record with TTL 3600 doesn't reach every user in 3600 seconds — it reaches users up to 3600 seconds after the change, depending on when their resolver last cached it.

The TTL tradeoff¶

High TTL (3600–86400 seconds):
  + Less load on authoritative servers
  + Faster resolution for clients (cached)
  - Changes take hours to reach all users
  - Can't fail over quickly during incidents

Low TTL (30–300 seconds):
  + Changes take effect within minutes
  + Fast failover during incidents
  - More load on authoritative servers
  - Some ISP resolvers ignore low TTLs and enforce a minimum (30–60s)

Trivia: Studies have shown that many ISP resolvers cheat on TTL. Some enforce a minimum cache time of 30 seconds regardless of what the authoritative server specifies. A few older ISP resolvers have been observed caching for up to 5 minutes even when TTL is 0. This means "instant failover via low TTL" is a myth for a small but painful tail of users.

The migration TTL dance¶

This is a procedure you'll execute dozens of times in your career:

Normal state:       TTL 3600 (1 hour)
48 hours before:    Lower TTL to 60 seconds
                    (wait for all old cached entries to expire)
Migration time:     Change the record to the new IP
                    (with TTL still at 60, propagation takes ~1 minute)
Post-migration:     Monitor for 24–48 hours
                    Raise TTL back to 3600

The 48-hour lead time is critical. If your current TTL is 86400 (24 hours), lowering it to 60 only helps after existing cached entries expire — up to 24 hours later. If you lower the TTL and immediately change the IP, some resolvers still have the old answer cached with the old TTL.

War Story: A team migrating from one cloud provider to another forgot to lower their TTL before switching IP addresses. Their A record had a TTL of 86400 — 24 hours. They switched the IP at 9 AM Monday. By noon, internal testing showed success. They declared the migration complete. But ISP resolvers that had cached the old IP at 8:59 AM held onto it until 8:59 AM Tuesday. For 24 hours, approximately 30% of their users reached the old provider's IP, which returned connection refused. The fix was to wait. There is no way to force remote caches to expire early. The lesson: lower TTL before you need it, not during the emergency.

Flashcard Check #2¶

Question	Answer
What does "DNS propagation" actually mean?	Independent cache expiration across resolvers — not a push mechanism
Why should you lower TTL 48 hours before a migration?	Existing caches hold the old TTL; you need that time for them to expire
Can you force a remote resolver to flush its cache?	No. You can only wait for TTL expiration.
What is negative caching?	Resolvers cache NXDOMAIN (name not found) responses, controlled by SOA minimum TTL
Why can't CNAME exist at the zone apex?	The apex requires SOA and NS records, and CNAME can't coexist with other record types

Part 4: dig +trace — Following the Resolution Chain¶

Back to our mission. We know one nameserver is returning SERVFAIL. Let's trace the full resolution path to understand what's happening.

dig app.example.com +trace

; <<>> DiG 9.18.28 <<>> app.example.com +trace
;; global options: +cmd
.                   483012  IN  NS  a.root-servers.net.
.                   483012  IN  NS  b.root-servers.net.
;; (13 root servers)
;; Received 525 bytes from 127.0.0.53#53 in 0 ms

com.                172800  IN  NS  a.gtld-servers.net.
com.                172800  IN  NS  b.gtld-servers.net.
;; (13 TLD servers)
;; Received 1170 bytes from 198.41.0.4#53(a.root-servers.net) in 23 ms

example.com.        172800  IN  NS  ns-1234.awsdns-56.org.
example.com.        172800  IN  NS  ns-789.awsdns-01.co.uk.
example.com.        172800  IN  NS  ns-456.awsdns-23.com.
example.com.        172800  IN  NS  ns-012.awsdns-78.net.
;; Received 662 bytes from 192.5.6.30#53(a.gtld-servers.net) in 31 ms

app.example.com.    60      IN  A   203.0.113.10
;; Received 78 bytes from 205.251.195.234#53(ns-1234.awsdns-56.org) in 15 ms

Read it bottom to top: root servers pointed us to .com TLD servers, which pointed us to four Route 53 nameservers, one of which returned the actual A record with TTL 60.

Trivia: There are exactly 13 root server identities (A through M) because all 13 addresses had to fit in a single 512-byte DNS UDP response — a protocol constraint from 1987. In reality, those 13 identities are served by over 1,700 physical servers worldwide using anycast routing.

What +trace reveals that +short can't¶

The trace shows you which nameserver answered at each level. If the problem is at the TLD level (wrong NS delegation), or at the authoritative level (one NS returning errors), the trace pinpoints exactly where the chain breaks.

For our mission, the trace succeeded because dig +trace happened to pick a working nameserver. Let's explicitly test the broken one:

dig app.example.com @ns-789.awsdns-01.co.uk
# ;; Got answer:
# ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 41285

SERVFAIL from one specific Route 53 nameserver. Time to investigate Route 53.

Part 5: Route 53 — Routing Policies, Health Checks, and Silent Failures¶

Route 53 isn't just a DNS host — it's a traffic router. It supports seven routing policies that control which answer a query gets.

The routing policies, ranked by how often you'll use them¶

Policy	What it does	When to use it
Simple	One record, one or more values	Static sites, simple setups
Failover	Primary until health check fails, then secondary	Active-passive DR
Weighted	Distribute by percentage (90/10, 50/50)	Canary deploys, gradual migrations
Latency	Route to lowest-latency region	Multi-region active-active
Geolocation	Route by user's country/continent	Compliance, localized content
Multivalue	Up to 8 healthy records, random	Poor man's load balancer with health checks
Geoproximity	Geographic + adjustable bias	Fine-grained traffic shaping

Gotcha: Route 53 alias records are an AWS proprietary extension — they don't exist in standard DNS. AWS invented them to solve the zone apex CNAME problem. An alias looks like an A record to resolvers but internally points to an AWS resource (ALB, CloudFront, S3). Alias queries to AWS resources are free. CNAME queries are not.

Failover + health checks: the mission's root cause¶

Let's look at what happened in our incident. The record uses failover routing with health checks:

aws route53 get-health-check-status --health-check-id hc-primary-789 \
  --query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}'
# us-east-1: Success, us-west-1: Success, eu-west-1: Failure, ap-southeast-1: Success

The EU health checker is failing. Route 53 health checkers run from multiple regions. When enough checkers report failure, Route 53 stops returning that record to resolvers whose queries arrive via the affected nameserver.

Mental Model: Route 53 health checks are the brain behind routing policies. Without them, failover and weighted routing are just static configurations. With them, Route 53 actively removes unhealthy endpoints from DNS responses. But the health check failing doesn't necessarily mean the endpoint is down — it means the health checker can't reach the endpoint. A firewall, WAF, or rate limiter blocking the health checker IP range causes the same result.

The EU health checker was being blocked by a WAF rate-limiting rule. The service was healthy. The health check was not. Route 53 removed the endpoint from responses to EU resolvers, and European customers got SERVFAIL.

The fix: two steps¶

Step 1: Restore DNS immediately — UPSERT the failover record without the HealthCheckId field. Route 53 will return the record regardless of health check status.

Step 2: Fix the root cause — allowlist Route 53 health checker IPs in the WAF:

curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | \
  jq '[.prefixes[] | select(.service=="ROUTE53_HEALTHCHECKS") | .ip_prefix]'
# Add these CIDRs to your WAF allowlist, then re-attach the health check

One more tool: aws route53 test-dns-answer --hosted-zone-id Z123 --record-name app.example.com --record-type A shows exactly what Route 53 would return, including routing policy evaluation and health check status. It bypasses all caching. Use it first when debugging Route 53.

Part 6: Split-Horizon DNS — Same Name, Different Answers¶

Split-horizon DNS returns different answers depending on who's asking. Internal clients get private IPs; external clients get public IPs. It's essential for hybrid environments and it makes debugging twice as hard.

In BIND, you implement this with "views" — match-clients directives that serve different zone files based on source IP. In Route 53, you create two hosted zones with the same name (one public, one private). The private zone is associated with your VPCs. Queries from inside the VPC hit the private zone; queries from the internet hit the public zone.

The split-horizon debugging trap¶

When someone reports "I can't reach app.example.com," you must ask: from where? The answer depends on which side of the split they're on.

# Test from inside the network
dig app.example.com @internal-dns +short   # Should be 10.0.1.50

# Test from outside
dig app.example.com @8.8.8.8 +short       # Should be 203.0.113.50

Gotcha: If internal clients get the external IP, they hit the firewall's public interface instead of the private IP. Traffic hairpins through the firewall — slower at best, blocked at worst. Check the BIND view's match-clients list: if the client's subnet isn't included, they fall through to the external view.

Part 7: CoreDNS in Kubernetes — The ndots:5 Trap¶

CoreDNS is the default DNS server in Kubernetes. It resolves service names, pod names, and external names for all cluster traffic. And it has one default setting that catches almost everyone.

Kubernetes DNS names¶

Service:     my-service.default.svc.cluster.local
Pod:         10-0-2-100.default.pod.cluster.local
StatefulSet: web-0.my-service.default.svc.cluster.local
Short form:  my-service  (within the same namespace)

The ndots:5 problem¶

Every pod's /etc/resolv.conf looks like this:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

The ndots:5 setting means: if a name has fewer than 5 dots, try the search domains first before querying the name as-is. Watch what happens when your pod tries to resolve api.github.com:

api.github.com has 2 dots (< 5), so Kubernetes tries:
1. api.github.com.default.svc.cluster.local   → NXDOMAIN
2. api.github.com.svc.cluster.local           → NXDOMAIN
3. api.github.com.cluster.local               → NXDOMAIN
4. api.github.com                             → SUCCESS (finally!)

That's 3 wasted queries before the real one. And each query generates two DNS lookups (A and AAAA records), so it's actually 6 wasted queries plus 2 real ones = 8 total DNS queries for one hostname resolution.

Under the Hood: At 1ms per CoreDNS query, that's 8ms overhead per external lookup. Tolerable. But if CoreDNS is under load or upstream resolution is slow, it compounds to seconds. A microservice making 1,000 external API calls per second generates 8,000 DNS queries per second instead of 2,000. That's the difference between CoreDNS running fine and CoreDNS falling over.

The fix: three options¶

# Option 1: Override ndots per pod
apiVersion: v1
kind: Pod
metadata:
  name: external-api-caller
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
  containers:
    - name: app
      image: my-app:latest

# Option 2: Append trailing dot in your code
# Instead of: requests.get("https://api.github.com/repos")
# Use:        requests.get("https://api.github.com./repos")
# The trailing dot makes it a FQDN — skips search domains entirely

Option 3: Add the autopath plugin to the CoreDNS Corefile (resolves all search suffixes in a single roundtrip — edit the coredns ConfigMap in kube-system and add autopath @kubernetes after the kubernetes block).

Quick CoreDNS debugging¶

# Are CoreDNS pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=30

# Test DNS from inside the cluster
kubectl run -it --rm dnstest --image=busybox:1.36 -- nslookup kubernetes.default

# Check a pod's resolv.conf
kubectl exec my-pod -- cat /etc/resolv.conf

Flashcard Check #3¶

Question	Answer
What does Kubernetes `ndots:5` do?	Names with fewer than 5 dots try all search domain suffixes before the real lookup
How many DNS queries does `api.github.com` generate with ndots:5?	8 (4 suffixes x 2 for A + AAAA records)
What are two fixes for the ndots problem?	Set `ndots: 2` in pod dnsConfig, or append a trailing dot to FQDNs
What is the CoreDNS ConfigMap called and where does it live?	`coredns` in the `kube-system` namespace
What does `dnsPolicy: ClusterFirst` mean?	Use CoreDNS for resolution (the default for pods)

Part 8: DNSSEC — Trust but Verify¶

Standard DNS has no authentication. When your resolver asks "what is bank.example.com?" it has no way to verify the answer actually came from the authoritative server. The Kaminsky attack (2008) demonstrated that an attacker could poison a resolver's cache in seconds.

DNSSEC adds cryptographic signatures to DNS responses.

The chain of trust¶

Root Zone (.)
  │  KSK signs → DNSKEY RRset
  │  ZSK signs → .com DS record (hash of .com's KSK)
  ▼
.com TLD
  │  KSK signs → DNSKEY RRset
  │  ZSK signs → example.com DS record (hash of example.com's KSK)
  ▼
example.com
  │  KSK signs → DNSKEY RRset
  │  ZSK signs → all other records (A, MX, etc.)
  ▼
app.example.com A 203.0.113.10  ← verified authentic by RRSIG

KSK (Key Signing Key): Signs only the DNSKEY record set. Its hash (DS record) lives in the parent zone. Changed rarely because rollover requires parent coordination.

ZSK (Zone Signing Key): Signs all actual data records. Rotated more frequently (monthly to yearly). Doesn't require parent zone update.

Remember: "KSK = King Signs Keys, ZSK = Zone Signs Everything." The KSK only signs the keys themselves. The ZSK signs the data.

Checking DNSSEC¶

# Does this domain have DNSSEC?
dig +dnssec example.com A

# Look for the AD flag (Authenticated Data) in the response
# ;; flags: qr rd ra ad;   ← AD means your resolver validated the signature

# Full DNSSEC trace
dig +trace +dnssec example.com

# Check DS record at the parent (the link between parent and child)
dig DS example.com @a.gtld-servers.net +short

Gotcha: DNSSEC adds operational risk. If the KMS key backing your Route 53 signing is deleted, or the DS record at your registrar expires, your domain becomes unresolvable for DNSSEC-validating resolvers. That's worse than not having DNSSEC at all. Enable it, but monitor signature expiry aggressively.

DNS over HTTPS and DNS over TLS¶

DNSSEC authenticates answers (they're genuine). DoH/DoT encrypt the query (nobody can see what you looked up). They're complementary, not competing.

Protocol	Port	Blocks easily?	Who uses it
DNS (plain)	53 (UDP/TCP)	Yes	Everything by default
DoT	853 (TCP+TLS)	Yes (dedicated port)	systemd-resolved, Android 9+
DoH	443 (HTTPS)	Hard (same as web traffic)	Firefox, Chrome, Cloudflare

Interview tip: If asked about DNSSEC vs DoH/DoT: DNSSEC provides authentication (the answer is genuine), DoH/DoT provide privacy (nobody sees your query). They're complementary. Test DoH yourself: curl -s -H 'accept: application/dns-json' 'https://cloudflare-dns.com/dns-query?name=example.com&type=A' | jq

Part 9: The 3 AM Debugging Ladder¶

When DNS is broken, work through this sequence. Each step rules out one layer:

Can you resolve it? dig app.example.com +short
Can public resolvers? dig @8.8.8.8 and dig @1.1.1.1 — if yes, your local resolver is the problem
What do the authoritative NSes say? Query each one individually: dig @ns1.example.com +short
Are secondaries in sync? Compare SOA serial: dig SOA example.com @ns1 vs @ns2
Where does the chain break? dig +trace walks root → TLD → authoritative
What does Route 53 think? aws route53 test-dns-answer bypasses all caching
Are health checks healthy? aws route53 get-health-check-status
What's on the wire? tcpdump -n -i any port 53 -c 50

Remember: DNS uses UDP port 53 for queries under 512 bytes and TCP port 53 for zone transfers and large responses (DNSSEC, large TXT records). If your firewall only allows UDP 53, DNSSEC validation and zone transfers will silently fail.

Exercises¶

Exercise 1: dig triage and trace (5 minutes)¶

Pick any domain and run the full diagnostic sequence:

dig example.com +short                  # What IP?
dig example.com @8.8.8.8 +short        # Same from Google's resolver?
dig example.com NS +short              # How many nameservers?
dig example.com SOA +short             # Serial format?
dig example.com +trace                 # Walk the full chain

What to look for

- Does your local resolver and 8.8.8.8 return the same IP? If not, caching or split-horizon. - How many NS records? (Should be at least 2.) - Is the SOA serial date-based (YYYYMMDDNN) or arbitrary? - In the trace: how many root servers listed? Which TLD server answered? What's the final TTL?

Exercise 2: Find the ndots problem (10 minutes)¶

If you have access to a Kubernetes cluster:

kubectl run -it --rm dnstest --image=busybox:1.36 -- sh
# Inside the pod:
cat /etc/resolv.conf
nslookup api.github.com
# Note the time
# Now try:
nslookup api.github.com.
# Note the time difference (with trailing dot should be faster)

What you should see

The trailing-dot version should resolve noticeably faster because it skips the search domain suffix attempts. The `resolv.conf` should show `ndots:5` and several search domains.

Exercise 3: Design a failover architecture (15 minutes)¶

You have a web application in us-east-1 (primary) and us-west-2 (DR). Design the Route 53 configuration:

What routing policy do you use?
What health check configuration?
What TTL on the DNS records?
What happens if the health checker itself is blocked by a firewall?

Solution

1. Failover routing with PRIMARY (us-east-1) and SECONDARY (us-west-2) 2. HTTPS health check on `/health`, 10-second interval, failure threshold of 3 3. TTL 60 seconds (low enough for fast failover, high enough to not overwhelm Route 53) 4. Route 53 removes the primary from DNS responses, causing all traffic to hit the secondary — even if the primary is actually healthy. Always allowlist Route 53 health checker IPs in your firewall/WAF.

Cheat Sheet¶

dig Quick Reference¶

Command	What it does
`dig example.com`	Full query with all sections
`dig +short example.com`	Just the answer
`dig @8.8.8.8 example.com`	Query a specific resolver
`dig +trace example.com`	Walk the full resolution chain
`dig -x 203.0.113.10`	Reverse lookup
`dig example.com MX`	Query specific record type
`dig +dnssec example.com`	Show DNSSEC signatures
`dig example.com SOA +short`	Check serial number

DNS Record Types¶

Type	Maps	Key fact
A	name → IPv4	Most common record
AAAA	name → IPv6	Needed for dual-stack
CNAME	name → name	Can't be at zone apex, can't coexist with other types
MX	domain → mail server	Priority number: lower = preferred
NS	domain → nameserver	Delegation chain
SOA	zone metadata	Serial must increment on every change
TXT	arbitrary text	SPF, DKIM, domain verification
PTR	IP → name	Required for email, reverse DNS
SRV	service location	Used by Consul, Kubernetes headless services
CAA	authorized CAs	Controls who can issue TLS certs

Route 53 Routing Policies¶

Policy	Behavior	Health check?
Simple	Return all values, random order	No
Failover	Primary until unhealthy, then secondary	Required
Weighted	Percentage-based distribution	Optional
Latency	Lowest latency region	Optional
Geolocation	User's country/continent	Optional
Multivalue	Up to 8 healthy records	Required

TTL Strategy¶

Scenario	TTL
Normal operations	300–3600s
48h before migration	60s
During migration	60s
After migration verified	Raise to 300–3600s
Static records (MX, NS)	3600–86400s

Takeaways¶

dig is the DNS debugging tool. nslookup hides the information you need most: TTL, authoritative flags, and the full resolution chain. Reach for dig first, always.
DNS doesn't propagate — caches expire. There's no push mechanism. "Propagation time" is the maximum TTL across all caching layers. Lower TTL before you need fast changes.
Test every authoritative nameserver individually. One broken NS out of four means 25% of users fail intermittently — the hardest kind of failure to reproduce.
Health checks can cause the outages they're designed to prevent. If a firewall or WAF blocks the health checker IPs, Route 53 removes a healthy endpoint from DNS. Always allowlist health checker IP ranges.
Kubernetes ndots:5 multiplies external DNS queries by 4x. Set ndots: 2 or use trailing dots for pods that primarily call external services.
DNSSEC authenticates, DoH/DoT encrypts. They solve different problems and are complementary. DNSSEC proves the answer is genuine; DoH/DoT hide what you asked.

What Happens When You Click a Link — follows a request end-to-end from DNS through TCP, TLS, HTTP, and back
Connection Refused — differential diagnosis where DNS is one of many possible causes
The Load Balancer Lied — health checks and routing gone wrong at the infrastructure layer
Kubernetes Services: How Traffic Finds Your Pod — how CoreDNS fits into the Kubernetes networking model
What Happens When Your Certificate Expires — TLS and DNS (CAA records) intersect