DNS Operations - Street-Level Ops¶
What experienced DNS operators know from years of "it's always DNS" incidents.
Quick Diagnosis Commands¶
# Quick A record check
dig +short app.example.com
# Compare answers from two nameservers
dig +short app.example.com @ns1.example.com
dig +short app.example.com @ns2.example.com
# Full trace to see the resolution path
dig +trace app.example.com
# Check SOA serial (are secondaries in sync?)
dig SOA example.com @ns1.example.com +short
dig SOA example.com @ns2.example.com +short
# Reverse lookup
dig -x 10.0.1.50
# Check all records for a name
dig example.com ANY +noall +answer
# Test from public resolver (bypasses internal DNS)
dig app.example.com @8.8.8.8 +short
# Check DNS resolution time
dig app.example.com | grep "Query time"
# DNS traffic on the wire
tcpdump -n -i any port 53 -c 50
# Kubernetes CoreDNS check
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=30
kubectl run -it --rm dnstest --image=busybox -- nslookup kubernetes.default
# Check /etc/resolv.conf inside a pod
kubectl exec <pod> -- cat /etc/resolv.conf
# BIND zone file syntax check
named-checkzone example.com /var/named/example.com.zone
named-checkconf /etc/named.conf
Gotcha: DNS Change Not Propagating¶
You updated the zone file on the primary nameserver. Secondary servers still serve the old record. dig @ns2 shows stale data.
Fix:
# 1. Check if you incremented the SOA serial
dig SOA example.com @ns1.example.com +short
# 2026031502 ns1.example.com. admin.example.com. 3600 900 604800 300
dig SOA example.com @ns2.example.com +short
# 2026031501 — stale! serial is older
# 2. If serial wasn't incremented, fix and reload
# Edit zone file, increment serial
# Then reload
rndc reload example.com
# 3. Force zone transfer on secondary
rndc retransfer example.com # On the secondary server
# 4. If using dynamic DNS, check for journal conflicts
ls -la /var/named/example.com.zone.jnl
# Delete journal and reload if corrupted
rndc freeze example.com
rm /var/named/example.com.zone.jnl
rndc thaw example.com
Gotcha: CNAME and Other Records Conflict¶
You add a CNAME for app.example.com but there is already an A record. BIND rejects the zone. Or worse, some resolvers return unpredictable results.
Fix: A CNAME cannot coexist with any other record type for the same name. If app.example.com has an A record, you cannot add a CNAME. Choose one:
# Option A: Use only A records
app IN A 10.0.1.50
app IN A 10.0.1.51
# Option B: Use CNAME (remove all other records for that name)
app IN CNAME loadbalancer.example.com.
# Now you CANNOT have app IN MX, app IN TXT, etc.
Gotcha: Kubernetes DNS Lookups Are Slow for External Domains¶
Under the hood: With
ndots: 5, resolvingapi.github.comgenerates 8 DNS queries (4 suffixes x 2 for A + AAAA records) before the final successful query. At ~1ms each from CoreDNS, that adds 8-10ms per lookup -- but if CoreDNS is under load or upstream is slow, it compounds to seconds.
Your pods take 2-5 seconds to resolve api.github.com. The default ndots: 5 in Kubernetes means any name with fewer than 5 dots first tries cluster suffixes.
Fix:
# Resolution attempts for "api.github.com" with ndots:5:
1. api.github.com.default.svc.cluster.local → NXDOMAIN
2. api.github.com.svc.cluster.local → NXDOMAIN
3. api.github.com.cluster.local → NXDOMAIN
4. api.github.com.example.com → NXDOMAIN (search domain)
5. api.github.com. → SUCCESS (finally!)
# Fix: Override dnsConfig for pods that make external calls
apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: "2"
# Or append a trailing dot in your code: "api.github.com."
Gotcha: resolv.conf Gets Overwritten¶
You manually edit /etc/resolv.conf and it gets overwritten by NetworkManager, systemd-resolved, or DHCP on next renewal.
Fix:
# On systemd-resolved systems
# Edit /etc/systemd/resolved.conf instead:
[Resolve]
DNS=10.0.1.10 10.0.1.11
FallbackDNS=8.8.8.8
Domains=example.com
systemctl restart systemd-resolved
# On NetworkManager systems
# Edit /etc/NetworkManager/conf.d/dns.conf:
[main]
dns=none
# Then manage /etc/resolv.conf manually
# On DHCP systems
# Use dhclient hooks or cloud-init to set DNS
Gotcha: Split-Horizon Returns Wrong Answer¶
Internal users get the external IP for an internal service. They hit the firewall's external interface instead of the private IP. Traffic hairpins through the firewall, or worse, gets blocked.
Fix:
# 1. Verify which view the client is hitting
dig app.example.com @internal-dns +short # Should be 10.0.1.50
dig app.example.com @external-dns +short # Should be 203.0.113.50
# 2. Check BIND view match-clients
# Internal view must match the client's source IP
# If client is on 172.16.0.0/12 but view only matches 10.0.0.0/8,
# client falls through to external view
# 3. Fix: add all internal ranges to internal view
view "internal" {
match-clients {
10.0.0.0/8;
172.16.0.0/12;
192.168.0.0/16;
127.0.0.0/8;
};
# ...
};
Pattern: Pre-Migration DNS Checklist¶
Gotcha: Lowering TTL only works if you do it at least one full TTL period before the migration. If your current TTL is 3600 (1 hour), caches worldwide already have records with that TTL. You need to wait at least 1 hour after publishing the lower TTL before any resolvers will respect it.
# 48 hours before migration:
# 1. Lower TTL on records that will change
# Edit zone file: change TTL from 3600 to 60
app 60 IN A 10.0.1.50
# Increment serial, reload zone
rndc reload example.com
# 2. Verify TTL change propagated
dig app.example.com | grep -E "^app.*IN.*A"
# Should show TTL=60
# At migration time:
# 3. Update record to new IP
app 60 IN A 10.0.2.50
# Increment serial, reload
# 4. Verify new IP is served
dig +short app.example.com @ns1.example.com
dig +short app.example.com @ns2.example.com
dig +short app.example.com @8.8.8.8
# 5. Monitor for clients still hitting old IP
# On old server:
tcpdump -n -i eth0 port 443 -c 10
# After migration verified (24-48 hours):
# 6. Raise TTL back to normal
app 3600 IN A 10.0.2.50
Pattern: CoreDNS Custom Forwarding¶
# Forward specific domains to internal DNS
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . 8.8.8.8 8.8.4.4
cache 30
loop
reload
}
example.com:53 {
forward . 10.0.1.10 10.0.1.11
cache 30
}
internal.corp:53 {
forward . 10.0.1.10 10.0.1.11
cache 30
}
Emergency: DNS Server Down, Everything Failing¶
Remember: DNS is UDP port 53 for queries under 512 bytes and TCP port 53 for zone transfers and large responses (DNSSEC, large TXT records). If you only allow UDP 53 through your firewall, DNSSEC validation and zone transfers will silently fail.
# 1. Identify scope
# Can the server resolve anything?
dig +short google.com @127.0.0.1
dig +short google.com @8.8.8.8
# 2. Check named/bind status
systemctl status named
journalctl -u named --since "10 minutes ago"
# 3. Common causes:
# - Zone file syntax error after edit
named-checkconf /etc/named.conf
named-checkzone example.com /var/named/example.com.zone
# - Permission issue on zone files
ls -lZ /var/named/
# - Port 53 already in use
ss -tlnp | grep :53
# - Out of memory (large zones)
free -h
journalctl -u named | grep -i "out of memory"
# 4. Quick recovery: restart with config check
named-checkconf /etc/named.conf && systemctl restart named
# 5. If bind is down, point clients to secondary
# Update DHCP to serve ns2 as primary
# Or manually: echo "nameserver 10.0.1.11" > /etc/resolv.conf
Emergency: DNS Cache Poisoning Suspected¶
# 1. Check if answers are correct
dig app.example.com @your-resolver +short
# Compare with authoritative answer:
dig app.example.com @ns1.example.com +short
# 2. If answers differ, flush resolver cache
# BIND:
rndc flush
# systemd-resolved:
resolvectl flush-caches
# dnsmasq:
systemctl restart dnsmasq
# 3. Check for DNSSEC validation failures
dig app.example.com +dnssec @your-resolver
# 4. If you run a recursive resolver, restrict access
# Only allow queries from your networks
# Disable recursion on authoritative servers
Quick Reference¶
- Runbook: Dns Resolution