DNS Operations: When nslookup Isn't Enough
- lesson
- dns
- networking
- route-53
- kubernetes
- coredns
- dnssec
- debugging ---# DNS Operations — When nslookup Isn't Enough
Topics: DNS, networking, Route 53, Kubernetes, CoreDNS, DNSSEC, debugging Level: L1–L2 (Foundations → Operations) Time: 60–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
A developer on your team pings you at 10:15 AM: "The website loads fine for me but three customers in Europe say they get a 'server not found' error. Can you check if the site is down?"
You open the site. It loads instantly. You check the status page — all green. You ping the server — it responds. But the customers are sending screenshots of browser errors, and they're not making it up.
This is a DNS problem. Specifically, it's the kind of DNS problem that nslookup can't
diagnose, that "flush your cache" won't fix, and that makes experienced engineers mutter
"it's always DNS" while reaching for dig.
We'll diagnose this incident step by step. Each section peels back another layer of how DNS
actually works — from dig basics through TTL caching, Route 53 health checks, Kubernetes
CoreDNS traps, and DNSSEC.
Part 1: Your First Move — dig vs nslookup vs host¶
Three tools, wildly different usefulness.
# nslookup — the one everyone knows
nslookup app.example.com
# Server: 127.0.0.53
# Address: 127.0.0.53#53
# Non-authoritative answer:
# Name: app.example.com
# Address: 203.0.113.10
# host — slightly better, same limitations
host app.example.com
# app.example.com has address 203.0.113.10
# dig — the one that actually helps
dig app.example.com
# ;; ANSWER SECTION:
# app.example.com. 247 IN A 203.0.113.10
#
# ;; Query time: 12 msec
# ;; SERVER: 127.0.0.53#53(127.0.0.53)
# ;; WHEN: Tue Mar 23 10:17:32 UTC 2026
# ;; MSG SIZE rcvd: 62
The dig output shows the TTL remaining (247 seconds), the record type (A), the server
that answered, and the query time. nslookup shows almost none of this.
| Feature | nslookup | host | dig |
|---|---|---|---|
| Shows TTL | No | No | Yes |
| Shows authoritative vs cached | Vaguely | No | Yes (AA flag) |
| Query specific server | Yes | Yes | Yes |
| Trace full resolution path | No | No | Yes (+trace) |
| Show DNSSEC info | No | No | Yes (+dnssec) |
| Query specific record type | Clunky | Yes | Yes |
| Scriptable output | No | Somewhat | Yes (+short) |
| Available everywhere | Yes | Usually | Usually |
Name Origin:
digstands for Domain Information Groper. It was written as part of BIND (Berkeley Internet Name Domain) — the DNS server software that has run the majority of the internet's DNS since the 1980s. BIND itself was written by four UC Berkeley graduate students in the early 1980s as part of a DARPA grant. The daemon is callednamed— literally "name daemon."
The dig flags you'll use every week¶
dig app.example.com +short # Just the IP, nothing else
dig app.example.com @8.8.8.8 # Ask Google's resolver instead of yours
dig app.example.com +trace # Walk the entire resolution chain
dig -x 203.0.113.10 # Reverse lookup (IP → name)
dig example.com MX +short # Mail servers
dig example.com NS +short # Nameservers
dig example.com SOA # Zone metadata (serial number, TTLs)
dig example.com ANY +noall +answer # All record types
dig app.example.com +dnssec # Show DNSSEC signatures
dig @ns1.example.com example.com AXFR # Zone transfer (if allowed)
Back to the mission¶
Your first diagnostic move:
# What do YOU see?
dig app.example.com +short
# 203.0.113.10
# What does Google's resolver see?
dig app.example.com @8.8.8.8 +short
# 203.0.113.10
# What does Cloudflare's resolver see?
dig app.example.com @1.1.1.1 +short
# 203.0.113.10
# What do the authoritative nameservers say?
dig app.example.com NS +short
# ns-1234.awsdns-56.org.
# ns-789.awsdns-01.co.uk.
# ns-456.awsdns-23.com.
# ns-012.awsdns-78.net.
# Query each one directly
dig app.example.com @ns-1234.awsdns-56.org +short
# 203.0.113.10
dig app.example.com @ns-789.awsdns-01.co.uk +short
# SERVFAIL
There it is. One of the four authoritative nameservers is returning SERVFAIL. Three work, one doesn't. If you're lucky, your resolver picks a working one. If you're not, you get "server not found."
Mental Model: DNS failure is probabilistic. With four nameservers and one broken, roughly 25% of queries fail. But it's not a clean 25% — resolvers often stick with a nameserver that worked recently, so some users are fine for hours while others fail repeatedly. This is why "it works for me" is the most dangerous sentence in DNS debugging.
Flashcard Check #1¶
| Question | Answer |
|---|---|
What does dig +short do? |
Returns only the answer (IP address), no headers or metadata |
| Why is nslookup insufficient for production debugging? | It hides TTL, doesn't trace resolution paths, can't show DNSSEC info |
| If 1 of 4 authoritative nameservers is broken, what percentage of users are affected? | Roughly 25%, but it's probabilistic — some users may be stuck on the broken NS |
| What flag asks dig to use a specific resolver? | @ followed by the resolver IP: dig @8.8.8.8 example.com |
Part 2: The DNS Record Types That Break Things¶
Every DNS record type exists because someone needed to solve a specific problem. Here's every type you'll encounter in production, what it does, and what breaks when it's wrong.
The core records¶
| Type | What it does | Example | What breaks when it's wrong |
|---|---|---|---|
| A | Maps name to IPv4 | app.example.com. 300 IN A 203.0.113.10 |
Site unreachable, wrong server |
| AAAA | Maps name to IPv6 | app.example.com. 300 IN AAAA 2001:db8::1 |
IPv6 users fail, happy eyeballs delays |
| CNAME | Alias to another name | www.example.com. 300 IN CNAME app.example.com. |
Broken chains, apex violations |
| MX | Mail exchange server | example.com. 3600 IN MX 10 mail.example.com. |
All email delivery fails |
| TXT | Arbitrary text | "v=spf1 mx -all" |
SPF/DKIM/domain verification fails |
| NS | Nameserver delegation | example.com. 86400 IN NS ns1.example.com. |
Entire zone unreachable |
| SOA | Start of authority | Serial, refresh, retry, expire, min TTL | Secondaries out of sync |
| PTR | Reverse lookup (IP→name) | 10.113.0.203.in-addr.arpa. PTR app.example.com. |
Email rejected, SSH warnings |
| SRV | Service location | _http._tcp.example.com. SRV 10 60 8080 app1.example.com. |
Service discovery breaks |
| CAA | Certificate authority auth | example.com. CAA 0 issue "letsencrypt.org" |
TLS cert issuance blocked |
CNAME: the record that causes the most grief¶
A CNAME says "this name is an alias for that name." Simple concept, three sharp edges:
Edge 1: CNAME can't coexist with other records.
If app.example.com has a CNAME, it cannot also have an A, MX, TXT, or any other record.
This is per RFC 1034. Some DNS servers enforce it; others silently do unpredictable things.
Edge 2: CNAME at the zone apex is illegal.
You can't do example.com. IN CNAME mycdn.cloudfront.net. because the apex must have SOA
and NS records, and CNAME can't coexist with those. Route 53 invented "alias records" to
solve this. Cloudflare calls it "CNAME flattening." Neither is standard DNS.
Edge 3: CNAME chains slow resolution.
www → app → lb → actual IP means three lookups. Some resolvers follow chains poorly.
Gotcha: If you create a CNAME for
app.example.combut forget to remove the existing A record, BIND rejects the zone file. But some cloud DNS providers silently accept the conflict and return unpredictable results — sometimes the A, sometimes the CNAME target.
SOA: the record nobody reads until secondaries stop updating¶
example.com. IN SOA ns1.example.com. admin.example.com. (
2026032301 ; Serial — MUST increment on every change
3600 ; Refresh — how often secondaries check for updates (1h)
900 ; Retry — retry interval if refresh fails (15m)
604800 ; Expire — secondaries stop serving after this (7d)
300 ; Minimum TTL — negative caching duration (5m)
)
The serial number is the most operationally important field. If you edit a zone file and forget to increment it, secondary nameservers think they already have the latest version and ignore the update.
Remember: Use the date-based serial format YYYYMMDDNN (e.g.,
2026032301for the first change on March 23, 2026). The NN suffix allows 99 changes per day. Never use arbitrary numbers — you can't go backward, and if you accidentally set the serial to9999999999, recovering requires manual intervention on every secondary.
PTR: the invisible record that breaks email¶
PTR records map IPs back to names (the reverse of A records). Email servers check: does the
connecting IP have a PTR? Does it match the sending domain? If not, your email is spam.
Check with dig -x 203.0.113.10.
Part 3: TTL, Caching, and the "Works for Me" Problem¶
TTL (Time-To-Live) is the number of seconds a resolver is allowed to cache a DNS answer. It's the single most important operational parameter in DNS, and it's the reason your website works for you but not for customers in Europe.
Where caching happens¶
Your browser → has its own DNS cache (Chrome: chrome://net-internals/#dns)
Your OS → has a stub resolver cache (systemd-resolved, macOS mDNSResponder)
Your local resolver → has a recursive resolver cache (your ISP, corporate DNS, 8.8.8.8)
The TLD servers → cache delegation answers
Each layer → respects TTL independently
When you change a DNS record, every cached copy at every layer has to expire before all users see the new answer. This is what people call "DNS propagation" — but that term is misleading.
Mental Model: DNS doesn't "propagate" like a wave spreading outward. There's no push mechanism. Each cache independently expires and re-fetches. "Propagation time" is really "maximum TTL across all caching layers." A record with TTL 3600 doesn't reach every user in 3600 seconds — it reaches users up to 3600 seconds after the change, depending on when their resolver last cached it.
The TTL tradeoff¶
High TTL (3600–86400 seconds):
+ Less load on authoritative servers
+ Faster resolution for clients (cached)
- Changes take hours to reach all users
- Can't fail over quickly during incidents
Low TTL (30–300 seconds):
+ Changes take effect within minutes
+ Fast failover during incidents
- More load on authoritative servers
- Some ISP resolvers ignore low TTLs and enforce a minimum (30–60s)
Trivia: Studies have shown that many ISP resolvers cheat on TTL. Some enforce a minimum cache time of 30 seconds regardless of what the authoritative server specifies. A few older ISP resolvers have been observed caching for up to 5 minutes even when TTL is 0. This means "instant failover via low TTL" is a myth for a small but painful tail of users.
The migration TTL dance¶
This is a procedure you'll execute dozens of times in your career:
Normal state: TTL 3600 (1 hour)
48 hours before: Lower TTL to 60 seconds
(wait for all old cached entries to expire)
Migration time: Change the record to the new IP
(with TTL still at 60, propagation takes ~1 minute)
Post-migration: Monitor for 24–48 hours
Raise TTL back to 3600
The 48-hour lead time is critical. If your current TTL is 86400 (24 hours), lowering it to 60 only helps after existing cached entries expire — up to 24 hours later. If you lower the TTL and immediately change the IP, some resolvers still have the old answer cached with the old TTL.
War Story: A team migrating from one cloud provider to another forgot to lower their TTL before switching IP addresses. Their A record had a TTL of 86400 — 24 hours. They switched the IP at 9 AM Monday. By noon, internal testing showed success. They declared the migration complete. But ISP resolvers that had cached the old IP at 8:59 AM held onto it until 8:59 AM Tuesday. For 24 hours, approximately 30% of their users reached the old provider's IP, which returned connection refused. The fix was to wait. There is no way to force remote caches to expire early. The lesson: lower TTL before you need it, not during the emergency.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| What does "DNS propagation" actually mean? | Independent cache expiration across resolvers — not a push mechanism |
| Why should you lower TTL 48 hours before a migration? | Existing caches hold the old TTL; you need that time for them to expire |
| Can you force a remote resolver to flush its cache? | No. You can only wait for TTL expiration. |
| What is negative caching? | Resolvers cache NXDOMAIN (name not found) responses, controlled by SOA minimum TTL |
| Why can't CNAME exist at the zone apex? | The apex requires SOA and NS records, and CNAME can't coexist with other record types |
Part 4: dig +trace — Following the Resolution Chain¶
Back to our mission. We know one nameserver is returning SERVFAIL. Let's trace the full resolution path to understand what's happening.
; <<>> DiG 9.18.28 <<>> app.example.com +trace
;; global options: +cmd
. 483012 IN NS a.root-servers.net.
. 483012 IN NS b.root-servers.net.
;; (13 root servers)
;; Received 525 bytes from 127.0.0.53#53 in 0 ms
com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
;; (13 TLD servers)
;; Received 1170 bytes from 198.41.0.4#53(a.root-servers.net) in 23 ms
example.com. 172800 IN NS ns-1234.awsdns-56.org.
example.com. 172800 IN NS ns-789.awsdns-01.co.uk.
example.com. 172800 IN NS ns-456.awsdns-23.com.
example.com. 172800 IN NS ns-012.awsdns-78.net.
;; Received 662 bytes from 192.5.6.30#53(a.gtld-servers.net) in 31 ms
app.example.com. 60 IN A 203.0.113.10
;; Received 78 bytes from 205.251.195.234#53(ns-1234.awsdns-56.org) in 15 ms
Read it bottom to top: root servers pointed us to .com TLD servers, which pointed us to
four Route 53 nameservers, one of which returned the actual A record with TTL 60.
Trivia: There are exactly 13 root server identities (A through M) because all 13 addresses had to fit in a single 512-byte DNS UDP response — a protocol constraint from 1987. In reality, those 13 identities are served by over 1,700 physical servers worldwide using anycast routing.
What +trace reveals that +short can't¶
The trace shows you which nameserver answered at each level. If the problem is at the TLD level (wrong NS delegation), or at the authoritative level (one NS returning errors), the trace pinpoints exactly where the chain breaks.
For our mission, the trace succeeded because dig +trace happened to pick a working
nameserver. Let's explicitly test the broken one:
dig app.example.com @ns-789.awsdns-01.co.uk
# ;; Got answer:
# ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 41285
SERVFAIL from one specific Route 53 nameserver. Time to investigate Route 53.
Part 5: Route 53 — Routing Policies, Health Checks, and Silent Failures¶
Route 53 isn't just a DNS host — it's a traffic router. It supports seven routing policies that control which answer a query gets.
The routing policies, ranked by how often you'll use them¶
| Policy | What it does | When to use it |
|---|---|---|
| Simple | One record, one or more values | Static sites, simple setups |
| Failover | Primary until health check fails, then secondary | Active-passive DR |
| Weighted | Distribute by percentage (90/10, 50/50) | Canary deploys, gradual migrations |
| Latency | Route to lowest-latency region | Multi-region active-active |
| Geolocation | Route by user's country/continent | Compliance, localized content |
| Multivalue | Up to 8 healthy records, random | Poor man's load balancer with health checks |
| Geoproximity | Geographic + adjustable bias | Fine-grained traffic shaping |
Gotcha: Route 53 alias records are an AWS proprietary extension — they don't exist in standard DNS. AWS invented them to solve the zone apex CNAME problem. An alias looks like an A record to resolvers but internally points to an AWS resource (ALB, CloudFront, S3). Alias queries to AWS resources are free. CNAME queries are not.
Failover + health checks: the mission's root cause¶
Let's look at what happened in our incident. The record uses failover routing with health checks:
aws route53 get-health-check-status --health-check-id hc-primary-789 \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}'
# us-east-1: Success, us-west-1: Success, eu-west-1: Failure, ap-southeast-1: Success
The EU health checker is failing. Route 53 health checkers run from multiple regions. When enough checkers report failure, Route 53 stops returning that record to resolvers whose queries arrive via the affected nameserver.
Mental Model: Route 53 health checks are the brain behind routing policies. Without them, failover and weighted routing are just static configurations. With them, Route 53 actively removes unhealthy endpoints from DNS responses. But the health check failing doesn't necessarily mean the endpoint is down — it means the health checker can't reach the endpoint. A firewall, WAF, or rate limiter blocking the health checker IP range causes the same result.
The EU health checker was being blocked by a WAF rate-limiting rule. The service was healthy. The health check was not. Route 53 removed the endpoint from responses to EU resolvers, and European customers got SERVFAIL.
The fix: two steps¶
Step 1: Restore DNS immediately — UPSERT the failover record without the HealthCheckId
field. Route 53 will return the record regardless of health check status.
Step 2: Fix the root cause — allowlist Route 53 health checker IPs in the WAF:
curl -s https://ip-ranges.amazonaws.com/ip-ranges.json | \
jq '[.prefixes[] | select(.service=="ROUTE53_HEALTHCHECKS") | .ip_prefix]'
# Add these CIDRs to your WAF allowlist, then re-attach the health check
One more tool: aws route53 test-dns-answer --hosted-zone-id Z123 --record-name app.example.com --record-type A shows exactly what Route 53 would return, including routing policy evaluation and health check status. It bypasses all caching. Use it first when debugging Route 53.
Part 6: Split-Horizon DNS — Same Name, Different Answers¶
Split-horizon DNS returns different answers depending on who's asking. Internal clients get private IPs; external clients get public IPs. It's essential for hybrid environments and it makes debugging twice as hard.
In BIND, you implement this with "views" — match-clients directives that serve different
zone files based on source IP. In Route 53, you create two hosted zones with the same name
(one public, one private). The private zone is associated with your VPCs. Queries from
inside the VPC hit the private zone; queries from the internet hit the public zone.
The split-horizon debugging trap¶
When someone reports "I can't reach app.example.com," you must ask: from where? The answer depends on which side of the split they're on.
# Test from inside the network
dig app.example.com @internal-dns +short # Should be 10.0.1.50
# Test from outside
dig app.example.com @8.8.8.8 +short # Should be 203.0.113.50
Gotcha: If internal clients get the external IP, they hit the firewall's public interface instead of the private IP. Traffic hairpins through the firewall — slower at best, blocked at worst. Check the BIND view's
match-clientslist: if the client's subnet isn't included, they fall through to the external view.
Part 7: CoreDNS in Kubernetes — The ndots:5 Trap¶
CoreDNS is the default DNS server in Kubernetes. It resolves service names, pod names, and external names for all cluster traffic. And it has one default setting that catches almost everyone.
Kubernetes DNS names¶
Service: my-service.default.svc.cluster.local
Pod: 10-0-2-100.default.pod.cluster.local
StatefulSet: web-0.my-service.default.svc.cluster.local
Short form: my-service (within the same namespace)
The ndots:5 problem¶
Every pod's /etc/resolv.conf looks like this:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
The ndots:5 setting means: if a name has fewer than 5 dots, try the search domains first
before querying the name as-is. Watch what happens when your pod tries to resolve
api.github.com:
api.github.com has 2 dots (< 5), so Kubernetes tries:
1. api.github.com.default.svc.cluster.local → NXDOMAIN
2. api.github.com.svc.cluster.local → NXDOMAIN
3. api.github.com.cluster.local → NXDOMAIN
4. api.github.com → SUCCESS (finally!)
That's 3 wasted queries before the real one. And each query generates two DNS lookups (A and AAAA records), so it's actually 6 wasted queries plus 2 real ones = 8 total DNS queries for one hostname resolution.
Under the Hood: At 1ms per CoreDNS query, that's 8ms overhead per external lookup. Tolerable. But if CoreDNS is under load or upstream resolution is slow, it compounds to seconds. A microservice making 1,000 external API calls per second generates 8,000 DNS queries per second instead of 2,000. That's the difference between CoreDNS running fine and CoreDNS falling over.
The fix: three options¶
# Option 1: Override ndots per pod
apiVersion: v1
kind: Pod
metadata:
name: external-api-caller
spec:
dnsConfig:
options:
- name: ndots
value: "2"
containers:
- name: app
image: my-app:latest
# Option 2: Append trailing dot in your code
# Instead of: requests.get("https://api.github.com/repos")
# Use: requests.get("https://api.github.com./repos")
# The trailing dot makes it a FQDN — skips search domains entirely
Option 3: Add the autopath plugin to the CoreDNS Corefile (resolves all search
suffixes in a single roundtrip — edit the coredns ConfigMap in kube-system and add
autopath @kubernetes after the kubernetes block).
Quick CoreDNS debugging¶
# Are CoreDNS pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=30
# Test DNS from inside the cluster
kubectl run -it --rm dnstest --image=busybox:1.36 -- nslookup kubernetes.default
# Check a pod's resolv.conf
kubectl exec my-pod -- cat /etc/resolv.conf
Flashcard Check #3¶
| Question | Answer |
|---|---|
What does Kubernetes ndots:5 do? |
Names with fewer than 5 dots try all search domain suffixes before the real lookup |
How many DNS queries does api.github.com generate with ndots:5? |
8 (4 suffixes x 2 for A + AAAA records) |
| What are two fixes for the ndots problem? | Set ndots: 2 in pod dnsConfig, or append a trailing dot to FQDNs |
| What is the CoreDNS ConfigMap called and where does it live? | coredns in the kube-system namespace |
What does dnsPolicy: ClusterFirst mean? |
Use CoreDNS for resolution (the default for pods) |
Part 8: DNSSEC — Trust but Verify¶
Standard DNS has no authentication. When your resolver asks "what is bank.example.com?" it has no way to verify the answer actually came from the authoritative server. The Kaminsky attack (2008) demonstrated that an attacker could poison a resolver's cache in seconds.
DNSSEC adds cryptographic signatures to DNS responses.
The chain of trust¶
Root Zone (.)
│ KSK signs → DNSKEY RRset
│ ZSK signs → .com DS record (hash of .com's KSK)
▼
.com TLD
│ KSK signs → DNSKEY RRset
│ ZSK signs → example.com DS record (hash of example.com's KSK)
▼
example.com
│ KSK signs → DNSKEY RRset
│ ZSK signs → all other records (A, MX, etc.)
▼
app.example.com A 203.0.113.10 ← verified authentic by RRSIG
KSK (Key Signing Key): Signs only the DNSKEY record set. Its hash (DS record) lives in the parent zone. Changed rarely because rollover requires parent coordination.
ZSK (Zone Signing Key): Signs all actual data records. Rotated more frequently (monthly to yearly). Doesn't require parent zone update.
Remember: "KSK = King Signs Keys, ZSK = Zone Signs Everything." The KSK only signs the keys themselves. The ZSK signs the data.
Checking DNSSEC¶
# Does this domain have DNSSEC?
dig +dnssec example.com A
# Look for the AD flag (Authenticated Data) in the response
# ;; flags: qr rd ra ad; ← AD means your resolver validated the signature
# Full DNSSEC trace
dig +trace +dnssec example.com
# Check DS record at the parent (the link between parent and child)
dig DS example.com @a.gtld-servers.net +short
Gotcha: DNSSEC adds operational risk. If the KMS key backing your Route 53 signing is deleted, or the DS record at your registrar expires, your domain becomes unresolvable for DNSSEC-validating resolvers. That's worse than not having DNSSEC at all. Enable it, but monitor signature expiry aggressively.
DNS over HTTPS and DNS over TLS¶
DNSSEC authenticates answers (they're genuine). DoH/DoT encrypt the query (nobody can see what you looked up). They're complementary, not competing.
| Protocol | Port | Blocks easily? | Who uses it |
|---|---|---|---|
| DNS (plain) | 53 (UDP/TCP) | Yes | Everything by default |
| DoT | 853 (TCP+TLS) | Yes (dedicated port) | systemd-resolved, Android 9+ |
| DoH | 443 (HTTPS) | Hard (same as web traffic) | Firefox, Chrome, Cloudflare |
Interview tip: If asked about DNSSEC vs DoH/DoT: DNSSEC provides authentication (the answer is genuine), DoH/DoT provide privacy (nobody sees your query). They're complementary. Test DoH yourself:
curl -s -H 'accept: application/dns-json' 'https://cloudflare-dns.com/dns-query?name=example.com&type=A' | jq
Part 9: The 3 AM Debugging Ladder¶
When DNS is broken, work through this sequence. Each step rules out one layer:
- Can you resolve it?
dig app.example.com +short - Can public resolvers?
dig @8.8.8.8anddig @1.1.1.1— if yes, your local resolver is the problem - What do the authoritative NSes say? Query each one individually:
dig @ns1.example.com +short - Are secondaries in sync? Compare SOA serial:
dig SOA example.com @ns1vs@ns2 - Where does the chain break?
dig +tracewalks root → TLD → authoritative - What does Route 53 think?
aws route53 test-dns-answerbypasses all caching - Are health checks healthy?
aws route53 get-health-check-status - What's on the wire?
tcpdump -n -i any port 53 -c 50
Remember: DNS uses UDP port 53 for queries under 512 bytes and TCP port 53 for zone transfers and large responses (DNSSEC, large TXT records). If your firewall only allows UDP 53, DNSSEC validation and zone transfers will silently fail.
Exercises¶
Exercise 1: dig triage and trace (5 minutes)¶
Pick any domain and run the full diagnostic sequence:
dig example.com +short # What IP?
dig example.com @8.8.8.8 +short # Same from Google's resolver?
dig example.com NS +short # How many nameservers?
dig example.com SOA +short # Serial format?
dig example.com +trace # Walk the full chain
What to look for
- Does your local resolver and 8.8.8.8 return the same IP? If not, caching or split-horizon. - How many NS records? (Should be at least 2.) - Is the SOA serial date-based (YYYYMMDDNN) or arbitrary? - In the trace: how many root servers listed? Which TLD server answered? What's the final TTL?Exercise 2: Find the ndots problem (10 minutes)¶
If you have access to a Kubernetes cluster:
kubectl run -it --rm dnstest --image=busybox:1.36 -- sh
# Inside the pod:
cat /etc/resolv.conf
nslookup api.github.com
# Note the time
# Now try:
nslookup api.github.com.
# Note the time difference (with trailing dot should be faster)
What you should see
The trailing-dot version should resolve noticeably faster because it skips the search domain suffix attempts. The `resolv.conf` should show `ndots:5` and several search domains.Exercise 3: Design a failover architecture (15 minutes)¶
You have a web application in us-east-1 (primary) and us-west-2 (DR). Design the Route 53 configuration:
- What routing policy do you use?
- What health check configuration?
- What TTL on the DNS records?
- What happens if the health checker itself is blocked by a firewall?
Solution
1. Failover routing with PRIMARY (us-east-1) and SECONDARY (us-west-2) 2. HTTPS health check on `/health`, 10-second interval, failure threshold of 3 3. TTL 60 seconds (low enough for fast failover, high enough to not overwhelm Route 53) 4. Route 53 removes the primary from DNS responses, causing all traffic to hit the secondary — even if the primary is actually healthy. Always allowlist Route 53 health checker IPs in your firewall/WAF.Cheat Sheet¶
dig Quick Reference¶
| Command | What it does |
|---|---|
dig example.com |
Full query with all sections |
dig +short example.com |
Just the answer |
dig @8.8.8.8 example.com |
Query a specific resolver |
dig +trace example.com |
Walk the full resolution chain |
dig -x 203.0.113.10 |
Reverse lookup |
dig example.com MX |
Query specific record type |
dig +dnssec example.com |
Show DNSSEC signatures |
dig example.com SOA +short |
Check serial number |
DNS Record Types¶
| Type | Maps | Key fact |
|---|---|---|
| A | name → IPv4 | Most common record |
| AAAA | name → IPv6 | Needed for dual-stack |
| CNAME | name → name | Can't be at zone apex, can't coexist with other types |
| MX | domain → mail server | Priority number: lower = preferred |
| NS | domain → nameserver | Delegation chain |
| SOA | zone metadata | Serial must increment on every change |
| TXT | arbitrary text | SPF, DKIM, domain verification |
| PTR | IP → name | Required for email, reverse DNS |
| SRV | service location | Used by Consul, Kubernetes headless services |
| CAA | authorized CAs | Controls who can issue TLS certs |
Route 53 Routing Policies¶
| Policy | Behavior | Health check? |
|---|---|---|
| Simple | Return all values, random order | No |
| Failover | Primary until unhealthy, then secondary | Required |
| Weighted | Percentage-based distribution | Optional |
| Latency | Lowest latency region | Optional |
| Geolocation | User's country/continent | Optional |
| Multivalue | Up to 8 healthy records | Required |
TTL Strategy¶
| Scenario | TTL |
|---|---|
| Normal operations | 300–3600s |
| 48h before migration | 60s |
| During migration | 60s |
| After migration verified | Raise to 300–3600s |
| Static records (MX, NS) | 3600–86400s |
Takeaways¶
-
dig is the DNS debugging tool. nslookup hides the information you need most: TTL, authoritative flags, and the full resolution chain. Reach for
digfirst, always. -
DNS doesn't propagate — caches expire. There's no push mechanism. "Propagation time" is the maximum TTL across all caching layers. Lower TTL before you need fast changes.
-
Test every authoritative nameserver individually. One broken NS out of four means 25% of users fail intermittently — the hardest kind of failure to reproduce.
-
Health checks can cause the outages they're designed to prevent. If a firewall or WAF blocks the health checker IPs, Route 53 removes a healthy endpoint from DNS. Always allowlist health checker IP ranges.
-
Kubernetes ndots:5 multiplies external DNS queries by 4x. Set
ndots: 2or use trailing dots for pods that primarily call external services. -
DNSSEC authenticates, DoH/DoT encrypts. They solve different problems and are complementary. DNSSEC proves the answer is genuine; DoH/DoT hide what you asked.
Related Lessons¶
- What Happens When You Click a Link — follows a request end-to-end from DNS through TCP, TLS, HTTP, and back
- Connection Refused — differential diagnosis where DNS is one of many possible causes
- The Load Balancer Lied — health checks and routing gone wrong at the infrastructure layer
- Kubernetes Services: How Traffic Finds Your Pod — how CoreDNS fits into the Kubernetes networking model
- What Happens When Your Certificate Expires — TLS and DNS (CAA records) intersect