Portal | Level: L2: Operations | Topics: DNS | Domain: Networking
DNS Operations - Primer¶
Why This Matters¶
DNS is the invisible foundation under every service you run. When DNS works, nobody thinks about it. When it breaks, everything breaks — and the symptoms look like application failures, network issues, or authentication problems until someone finally thinks to check name resolution. I have spent more hours debugging "application issues" that turned out to be DNS problems than I care to admit. Understanding DNS deeply — from BIND zone files to CoreDNS in Kubernetes — is a core ops skill that pays off every single week.
DNS is also one of the most commonly misconfigured pieces of infrastructure. A bad TTL decision can mean hours of stale records after a migration. A missing PTR record can break email delivery. A split-horizon mistake can make internal services unreachable from the wrong network.
Core Concepts¶
1. DNS Resolution Flow¶
User types app.example.com
|
v
Local Resolver Cache (/etc/resolv.conf)
| (cache miss)
v
Recursive Resolver (ISP or 8.8.8.8)
|
v
Root Servers (.)
| "Ask .com servers"
v
TLD Servers (.com)
| "Ask example.com nameservers"
v
Authoritative Server (ns1.example.com)
| "app.example.com = 10.0.1.50"
v
Answer cached at each layer (per TTL)
Name origin: DNS was invented by Paul Mockapetris in 1983 (RFCs 882 and 883, later superseded by RFCs 1034/1035). Before DNS, hostname-to-IP mappings lived in a single file called HOSTS.TXT maintained by the Stanford Research Institute. Every machine on the ARPANET fetched a fresh copy via FTP. By the early 1980s, the file was changing so frequently that it was already stale by the time it was downloaded. DNS replaced this with a distributed, hierarchical database.
2. Record Types That Matter¶
| Type | Purpose | Example |
|---|---|---|
| A | IPv4 address | app.example.com. 300 IN A 10.0.1.50 |
| AAAA | IPv6 address | app.example.com. 300 IN AAAA 2001:db8::1 |
| CNAME | Alias to another name | www.example.com. 300 IN CNAME app.example.com. |
| MX | Mail exchange | example.com. 3600 IN MX 10 mail.example.com. |
| NS | Nameserver delegation | example.com. 86400 IN NS ns1.example.com. |
| PTR | Reverse lookup | 50.1.0.10.in-addr.arpa. 3600 IN PTR app.example.com. |
| SOA | Start of authority | Serial, refresh, retry, expire, minimum TTL |
| SRV | Service location | _http._tcp.example.com. 300 IN SRV 10 0 8080 app.example.com. |
| TXT | Arbitrary text | SPF records, DKIM, domain verification |
| CAA | Certificate authority auth | example.com. 3600 IN CAA 0 issue "letsencrypt.org" |
3. BIND Configuration¶
BIND (named) is the most widely deployed authoritative DNS server. It has been around since the 1980s and runs a significant portion of the internet's DNS infrastructure.
Name origin: BIND stands for Berkeley Internet Name Domain. It was written by four UC Berkeley graduate students in the early 1980s as part of a DARPA grant. The daemon is called
named— literally "name daemon." BIND is now maintained by ISC (Internet Systems Consortium) and is the most widely deployed DNS software on Earth.
# /etc/named.conf (main config)
options {
listen-on port 53 { 127.0.0.1; 10.0.1.10; };
directory "/var/named";
allow-query { localhost; 10.0.0.0/8; };
allow-transfer { 10.0.1.11; }; # Secondary DNS
recursion no; # Authoritative only
dnssec-validation auto;
};
zone "example.com" IN {
type master;
file "example.com.zone";
allow-update { none; };
notify yes;
};
zone "1.0.10.in-addr.arpa" IN {
type master;
file "10.0.1.rev";
};
# /var/named/example.com.zone
$TTL 300
@ IN SOA ns1.example.com. admin.example.com. (
2026031501 ; Serial (YYYYMMDDNN)
3600 ; Refresh (1 hour)
900 ; Retry (15 min)
604800 ; Expire (1 week)
300 ; Minimum TTL (5 min)
)
IN NS ns1.example.com.
IN NS ns2.example.com.
IN MX 10 mail.example.com.
ns1 IN A 10.0.1.10
ns2 IN A 10.0.1.11
app IN A 10.0.1.50
app IN A 10.0.1.51 ; Round-robin
mail IN A 10.0.1.60
www IN CNAME app.example.com.
staging IN A 10.0.2.50
Critical: The serial number MUST increase with every change. If it does not, secondary servers will not pick up the update.
4. Split-Horizon DNS¶
Different answers for internal vs. external clients. Essential for environments where internal services use private IPs but external clients need public IPs.
# /etc/named.conf with views
view "internal" {
match-clients { 10.0.0.0/8; 172.16.0.0/12; 192.168.0.0/16; };
zone "example.com" {
type master;
file "example.com.internal.zone";
};
};
view "external" {
match-clients { any; };
zone "example.com" {
type master;
file "example.com.external.zone";
};
};
5. DNS Debugging¶
# dig - the primary DNS debugging tool
dig app.example.com # Basic A record lookup
dig app.example.com @8.8.8.8 # Query specific server
dig app.example.com +short # Just the answer
dig app.example.com +trace # Full resolution path
dig -x 10.0.1.50 # Reverse lookup
dig example.com MX # MX records
dig example.com NS # Nameservers
dig example.com SOA # SOA record (serial)
dig example.com ANY +noall +answer # All records
dig example.com +dnssec # Show DNSSEC info
# Check zone transfer
dig @ns1.example.com example.com AXFR
# nslookup (simpler, available everywhere)
nslookup app.example.com
nslookup -type=MX example.com
nslookup app.example.com 10.0.1.10 # Query specific server
# DNS traffic capture
tcpdump -n -i eth0 port 53 -w dns.pcap
tcpdump -n -i eth0 port 53 -l # Live output
# Check /etc/resolv.conf
cat /etc/resolv.conf
# nameserver 10.0.1.10
# nameserver 10.0.1.11
# search example.com
# options timeout:2 attempts:3
6. TTL Strategy¶
TTL (Time-To-Live) = how long resolvers cache the answer
High TTL (3600-86400):
+ Less load on authoritative servers
+ Faster resolution for clients (cached)
- Slow propagation of changes
- Long outage if you need to change an IP quickly
Low TTL (30-300):
+ Fast propagation of changes
+ Quick failover during incidents
- More load on authoritative servers
- Slightly higher latency for first-time lookups
Strategy:
Normal operations: 300-3600 seconds
Before a migration: Lower to 60 seconds 48 hours before
During migration: Keep at 60 seconds
After migration verified: Raise back to 300-3600
7. CoreDNS in Kubernetes¶
CoreDNS is the default DNS server in Kubernetes. It resolves service names, pod names, and external names for all cluster traffic.
Kubernetes DNS resolution:
my-service → my-service.default.svc.cluster.local
my-service.other-ns → my-service.other-ns.svc.cluster.local
my-service.other-ns.svc → my-service.other-ns.svc.cluster.local
Pod DNS policy:
dnsPolicy: ClusterFirst → Use CoreDNS (default)
dnsPolicy: Default → Use node's /etc/resolv.conf
dnsPolicy: None → Use custom dnsConfig
# CoreDNS Corefile (from ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
8. DNSSEC Basics¶
DNSSEC adds cryptographic signatures to DNS responses.
Without DNSSEC:
Client asks: "What is app.example.com?"
Attacker intercepts and returns: "192.168.1.1" (malicious IP)
Client has no way to verify the answer
With DNSSEC:
Authoritative server signs responses with private key
Resolver verifies signature using published public key (DNSKEY)
Forged responses fail signature verification
Key types:
KSK (Key Signing Key) → signs the zone's DNSKEY records
ZSK (Zone Signing Key) → signs all other records in the zone
DS (Delegation Signer) → published in parent zone, chains trust
Common Pitfalls¶
Remember: Mnemonic for DNS record types: A CNAME MX NS SOA = "A Cat Might Nap Soundly On Anything." A (address), CNAME (alias), MX (mail), NS (nameserver), SOA (start of authority). The most frequently looked-up types in that order.
Debug clue: When
digreturnsNXDOMAIN, the name does not exist at all. When it returnsNOERRORwith an empty answer section, the name exists but has no record of the type you asked for (e.g., querying AAAA for a name that only has an A record). These two conditions look identical to applications but have very different causes.War story: One of the most infamous DNS outages was the 2016 Dyn DDoS attack. The Mirai botnet flooded Dyn's recursive resolvers with traffic from compromised IoT devices, taking down Twitter, Reddit, Netflix, and GitHub for hours. The root cause was not a DNS misconfiguration — it was that too many major services depended on a single DNS provider without a secondary.
- Forgetting the trailing dot — In zone files,
app.example.com(no dot) becomesapp.example.com.example.com.The trailing dot means "fully qualified." - Not incrementing the SOA serial — You edit the zone file but forget to increment the serial. Secondary servers never pick up the change. Use YYYYMMDDNN format.
- CNAME at the zone apex —
example.com. IN CNAME other.com.is illegal per RFC. Use ALIAS or ANAME if your DNS provider supports it, or use A records. - TTL too high before a migration — Your records have 86400 TTL. You change the IP. Some clients cache the old IP for 24 hours. Lower TTL before you need fast changes.
- Missing PTR records — Forward lookup works but reverse does not. This breaks email delivery, SSH host verification, and some logging systems.
- Allowing zone transfers to anyone —
allow-transfer { any; }lets anyone dump your entire zone. Restrict to secondary nameservers only. - Search domain appended unexpectedly —
/etc/resolv.confhassearch example.com. A lookup forappfirst triesapp.example.com. This causes confusion when internal and external names collide. - CoreDNS ndots causing slow external lookups — Kubernetes default
ndots: 5meansapi.github.comtries 4 cluster suffixes before the real lookup. Override withdnsConfigfor pods that make many external calls.
Wiki Navigation¶
Prerequisites¶
- Networking Deep Dive (Topic Pack, L1)
Next Steps¶
- DNSSEC & DNS Security (Topic Pack, L2)
- Email Infrastructure (Topic Pack, L1)
Related Content¶
- AWS Route 53 (Topic Pack, L2) — DNS
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — DNS
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — DNS
- Case Study: DNS Resolution Slow (Case Study, L1) — DNS
- Case Study: DNS Split Horizon Confusion (Case Study, L2) — DNS
- DHCP & IP Address Management (Topic Pack, L1) — DNS
- DNS Deep Dive (Topic Pack, L1) — DNS
- DNS Flashcards (CLI) (flashcard_deck, L1) — DNS
- Incident Simulator (18 scenarios) (CLI) (Exercise Set, L2) — DNS
- Networking Deep Dive (Topic Pack, L1) — DNS
Pages that link here¶
- Anti-Primer: DNS Ops
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- DHCP & IP Address Management
- DNS Deep Dive
- DNS Operations
- DNS Resolution Taking 5+ Seconds Intermittently
- DNS Split-Horizon Confusion
- DNSSEC & DNS Security
- Email Infrastructure
- Incident Replay: CoreDNS Timeout — Pod DNS Resolution Failing
- Incident Replay: DNS Resolution Slow
- Master Curriculum: 40 Weeks
- Networking Deep Dive
- Networking Drills