Skip to content

DNS Operations - Street-Level Ops

What experienced DNS operators know from years of "it's always DNS" incidents.

Quick Diagnosis Commands

# Quick A record check
dig +short app.example.com

# Compare answers from two nameservers
dig +short app.example.com @ns1.example.com
dig +short app.example.com @ns2.example.com

# Full trace to see the resolution path
dig +trace app.example.com

# Check SOA serial (are secondaries in sync?)
dig SOA example.com @ns1.example.com +short
dig SOA example.com @ns2.example.com +short

# Reverse lookup
dig -x 10.0.1.50

# Check all records for a name
dig example.com ANY +noall +answer

# Test from public resolver (bypasses internal DNS)
dig app.example.com @8.8.8.8 +short

# Check DNS resolution time
dig app.example.com | grep "Query time"

# DNS traffic on the wire
tcpdump -n -i any port 53 -c 50

# Kubernetes CoreDNS check
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=30
kubectl run -it --rm dnstest --image=busybox -- nslookup kubernetes.default

# Check /etc/resolv.conf inside a pod
kubectl exec <pod> -- cat /etc/resolv.conf

# BIND zone file syntax check
named-checkzone example.com /var/named/example.com.zone
named-checkconf /etc/named.conf

Gotcha: DNS Change Not Propagating

You updated the zone file on the primary nameserver. Secondary servers still serve the old record. dig @ns2 shows stale data.

Fix:

# 1. Check if you incremented the SOA serial
dig SOA example.com @ns1.example.com +short
# 2026031502 ns1.example.com. admin.example.com. 3600 900 604800 300

dig SOA example.com @ns2.example.com +short
# 2026031501 — stale! serial is older

# 2. If serial wasn't incremented, fix and reload
# Edit zone file, increment serial
# Then reload
rndc reload example.com

# 3. Force zone transfer on secondary
rndc retransfer example.com   # On the secondary server

# 4. If using dynamic DNS, check for journal conflicts
ls -la /var/named/example.com.zone.jnl
# Delete journal and reload if corrupted
rndc freeze example.com
rm /var/named/example.com.zone.jnl
rndc thaw example.com

Gotcha: CNAME and Other Records Conflict

You add a CNAME for app.example.com but there is already an A record. BIND rejects the zone. Or worse, some resolvers return unpredictable results.

Fix: A CNAME cannot coexist with any other record type for the same name. If app.example.com has an A record, you cannot add a CNAME. Choose one:

# Option A: Use only A records
app     IN  A   10.0.1.50
app     IN  A   10.0.1.51

# Option B: Use CNAME (remove all other records for that name)
app     IN  CNAME  loadbalancer.example.com.
# Now you CANNOT have app IN MX, app IN TXT, etc.

Gotcha: Kubernetes DNS Lookups Are Slow for External Domains

Under the hood: With ndots: 5, resolving api.github.com generates 8 DNS queries (4 suffixes x 2 for A + AAAA records) before the final successful query. At ~1ms each from CoreDNS, that adds 8-10ms per lookup -- but if CoreDNS is under load or upstream is slow, it compounds to seconds.

Your pods take 2-5 seconds to resolve api.github.com. The default ndots: 5 in Kubernetes means any name with fewer than 5 dots first tries cluster suffixes.

Fix:

# Resolution attempts for "api.github.com" with ndots:5:
1. api.github.com.default.svc.cluster.local  → NXDOMAIN
2. api.github.com.svc.cluster.local          → NXDOMAIN
3. api.github.com.cluster.local              → NXDOMAIN
4. api.github.com.example.com                → NXDOMAIN (search domain)
5. api.github.com.                           → SUCCESS (finally!)

# Fix: Override dnsConfig for pods that make external calls
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
  # Or append a trailing dot in your code: "api.github.com."

Gotcha: resolv.conf Gets Overwritten

You manually edit /etc/resolv.conf and it gets overwritten by NetworkManager, systemd-resolved, or DHCP on next renewal.

Fix:

# On systemd-resolved systems
# Edit /etc/systemd/resolved.conf instead:
[Resolve]
DNS=10.0.1.10 10.0.1.11
FallbackDNS=8.8.8.8
Domains=example.com
systemctl restart systemd-resolved

# On NetworkManager systems
# Edit /etc/NetworkManager/conf.d/dns.conf:
[main]
dns=none
# Then manage /etc/resolv.conf manually

# On DHCP systems
# Use dhclient hooks or cloud-init to set DNS

Gotcha: Split-Horizon Returns Wrong Answer

Internal users get the external IP for an internal service. They hit the firewall's external interface instead of the private IP. Traffic hairpins through the firewall, or worse, gets blocked.

Fix:

# 1. Verify which view the client is hitting
dig app.example.com @internal-dns +short   # Should be 10.0.1.50
dig app.example.com @external-dns +short   # Should be 203.0.113.50

# 2. Check BIND view match-clients
# Internal view must match the client's source IP
# If client is on 172.16.0.0/12 but view only matches 10.0.0.0/8,
# client falls through to external view

# 3. Fix: add all internal ranges to internal view
view "internal" {
    match-clients {
        10.0.0.0/8;
        172.16.0.0/12;
        192.168.0.0/16;
        127.0.0.0/8;
    };
    # ...
};

Pattern: Pre-Migration DNS Checklist

Gotcha: Lowering TTL only works if you do it at least one full TTL period before the migration. If your current TTL is 3600 (1 hour), caches worldwide already have records with that TTL. You need to wait at least 1 hour after publishing the lower TTL before any resolvers will respect it.

# 48 hours before migration:
# 1. Lower TTL on records that will change
# Edit zone file: change TTL from 3600 to 60
app     60  IN  A   10.0.1.50
# Increment serial, reload zone
rndc reload example.com

# 2. Verify TTL change propagated
dig app.example.com | grep -E "^app.*IN.*A"
# Should show TTL=60

# At migration time:
# 3. Update record to new IP
app     60  IN  A   10.0.2.50
# Increment serial, reload

# 4. Verify new IP is served
dig +short app.example.com @ns1.example.com
dig +short app.example.com @ns2.example.com
dig +short app.example.com @8.8.8.8

# 5. Monitor for clients still hitting old IP
# On old server:
tcpdump -n -i eth0 port 443 -c 10

# After migration verified (24-48 hours):
# 6. Raise TTL back to normal
app     3600  IN  A   10.0.2.50

Pattern: CoreDNS Custom Forwarding

# Forward specific domains to internal DNS
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        forward . 8.8.8.8 8.8.4.4
        cache 30
        loop
        reload
    }
    example.com:53 {
        forward . 10.0.1.10 10.0.1.11
        cache 30
    }
    internal.corp:53 {
        forward . 10.0.1.10 10.0.1.11
        cache 30
    }

Emergency: DNS Server Down, Everything Failing

Remember: DNS is UDP port 53 for queries under 512 bytes and TCP port 53 for zone transfers and large responses (DNSSEC, large TXT records). If you only allow UDP 53 through your firewall, DNSSEC validation and zone transfers will silently fail.

# 1. Identify scope
# Can the server resolve anything?
dig +short google.com @127.0.0.1
dig +short google.com @8.8.8.8

# 2. Check named/bind status
systemctl status named
journalctl -u named --since "10 minutes ago"

# 3. Common causes:
# - Zone file syntax error after edit
named-checkconf /etc/named.conf
named-checkzone example.com /var/named/example.com.zone

# - Permission issue on zone files
ls -lZ /var/named/

# - Port 53 already in use
ss -tlnp | grep :53

# - Out of memory (large zones)
free -h
journalctl -u named | grep -i "out of memory"

# 4. Quick recovery: restart with config check
named-checkconf /etc/named.conf && systemctl restart named

# 5. If bind is down, point clients to secondary
# Update DHCP to serve ns2 as primary
# Or manually: echo "nameserver 10.0.1.11" > /etc/resolv.conf

Emergency: DNS Cache Poisoning Suspected

# 1. Check if answers are correct
dig app.example.com @your-resolver +short
# Compare with authoritative answer:
dig app.example.com @ns1.example.com +short

# 2. If answers differ, flush resolver cache
# BIND:
rndc flush
# systemd-resolved:
resolvectl flush-caches
# dnsmasq:
systemctl restart dnsmasq

# 3. Check for DNSSEC validation failures
dig app.example.com +dnssec @your-resolver

# 4. If you run a recursive resolver, restrict access
# Only allow queries from your networks
# Disable recursion on authoritative servers

Quick Reference