DNS Deep Dive - Street-Level Ops¶

What experienced operators know from years of debugging "it's always DNS." These are the commands you run at 2 AM when resolution is broken and everything is on fire.

dig Deep Dive¶

dig is the primary DNS debugging tool. Learn it well — everything else is a wrapper or a subset.

Basic Queries¶

# Simple A record lookup
dig app.example.com

# Just the answer, no noise
dig +short app.example.com

# Query a specific nameserver
dig app.example.com @8.8.8.8

# Query a specific record type
dig -t MX example.com
dig -t AAAA app.example.com
dig -t SRV _http._tcp.example.com
dig -t TXT example.com
dig -t NS example.com
dig -t SOA example.com
dig -t CAA example.com

# Reverse lookup
dig -x 10.0.2.100

Tracing Resolution¶

# Full trace from root to answer — shows every step of the resolution chain
dig +trace app.example.com

# Output shows:
# .                   518400  IN  NS  a.root-servers.net.
# com.                172800  IN  NS  a.gtld-servers.net.
# example.com.        172800  IN  NS  ns1.example.com.
# app.example.com.    300     IN  A   10.0.2.100

# Trace with DNSSEC validation
dig +trace +dnssec app.example.com

Advanced dig Flags¶

# Show all sections (question, answer, authority, additional)
dig app.example.com +noall +answer +authority +additional

# Show only the answer section
dig app.example.com +noall +answer

# Show query time and server used
dig app.example.com +stats

# Request TCP instead of UDP
dig +tcp app.example.com

# Set EDNS buffer size
dig +bufsize=4096 app.example.com

# Attempt zone transfer
dig @ns1.example.com example.com AXFR

# Check if recursion is available
dig +recurse app.example.com @target-server
# Look for "ra" flag in response (recursion available)

# Query with specific source IP (useful on multi-homed servers)
dig -b 10.0.1.5 app.example.com @10.0.1.10

nslookup vs dig vs host¶

# nslookup — basic, available everywhere, interactive mode
nslookup app.example.com
nslookup -type=MX example.com
nslookup app.example.com 10.0.1.10  # specific server

# host — simple, one-line output
host app.example.com
host -t MX example.com
host -t NS example.com

# dig — full control, scriptable, shows TTL and flags
dig +short app.example.com

Use dig for debugging. nslookup and host are fine for quick checks, but dig shows TTLs, flags, authority sections, and supports tracing. In incident response, dig is the only tool that gives you the full picture.

Debugging DNS Resolution Failures¶

Step 1: Identify Which Resolver the Client Uses¶

# What resolver is configured?
cat /etc/resolv.conf

# On systemd-resolved systems
resolvectl status
resolvectl dns

# What is the host actually using? (nsswitch order)
grep hosts /etc/nsswitch.conf
# hosts: files dns myhostname
# → checks /etc/hosts first, then DNS

# Is there a local override in /etc/hosts?
grep app.example.com /etc/hosts

Step 2: Test From the Client's Perspective¶

# Use the same resolver the client uses
dig app.example.com @$(awk '/^nameserver/{print $2; exit}' /etc/resolv.conf)

# Compare with a known-good public resolver
dig app.example.com @8.8.8.8 +short

# Compare with the authoritative server directly
dig app.example.com @ns1.example.com +short

Step 3: Check for Caching Issues¶

# Query with CD flag (checking disabled) to bypass DNSSEC validation
dig +cd app.example.com

# Check the TTL — if it's decreasing with repeated queries, you're hitting cache
dig app.example.com | grep -E "^app"
# Wait a few seconds
dig app.example.com | grep -E "^app"
# If TTL decreased, the answer is cached

# Flush caches
# systemd-resolved:
resolvectl flush-caches
# BIND:
rndc flush
# dnsmasq:
systemctl restart dnsmasq  # dnsmasq has no flush command
# Chrome browser: chrome://net-internals/#dns → Clear host cache
# macOS:
# sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Windows:
# ipconfig /flushdns

Step 4: Check the Authoritative Server¶

# Is the authoritative server responding?
dig app.example.com @ns1.example.com +short

# Is the zone loaded?
dig SOA example.com @ns1.example.com

# Are primary and secondary in sync? (compare serials)
dig SOA example.com @ns1.example.com +short
dig SOA example.com @ns2.example.com +short

# Check zone file syntax
named-checkzone example.com /var/named/zones/example.com
named-checkconf /etc/named.conf

DNS Propagation Checking¶

After making a DNS change, verify it has propagated:

# Check from multiple public resolvers
for resolver in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
  echo -n "$resolver: "
  dig +short app.example.com @$resolver
done

# Check from authoritative servers
for ns in $(dig +short NS example.com); do
  echo -n "$ns: "
  dig +short app.example.com @$ns
done

# Check current TTL (how long until cache expires)
dig app.example.com @8.8.8.8 | awk '/^app/{print "TTL:", $2}'

# Monitor propagation over time
watch -n 10 'dig +short app.example.com @8.8.8.8'

TTL Strategy for Migrations¶

# Current state: Check existing TTL
dig app.example.com @ns1.example.com | awk '/^app/{print "Current TTL:", $2}'

# Step 1: Lower TTL to 60 seconds (48 hours before migration)
# Edit zone file:
#   app  60  IN  A  10.0.2.100
# Increment serial, then:
rndc reload example.com

# Step 2: Verify TTL change propagated
# Wait up to old-TTL seconds, then:
dig app.example.com @8.8.8.8 | awk '/^app/{print "TTL:", $2}'
# Should show 60 or less

# Step 3: At migration time — change the IP
# Edit zone file:
#   app  60  IN  A  10.0.3.200
# Increment serial, then:
rndc reload example.com

# Step 4: Verify from multiple vantage points
for resolver in 8.8.8.8 1.1.1.1 9.9.9.9; do
  echo -n "$resolver: "
  dig +short app.example.com @$resolver
done

# Step 5: Monitor old server for straggler connections
# On old server:
ss -tn | grep :443 | wc -l
tcpdump -n -i eth0 port 443 -c 20

# Step 6: After 48 hours stable — raise TTL back
# Edit zone file:
#   app  3600  IN  A  10.0.3.200
# Increment serial, reload

Route 53 Health Checks and Failover¶

# List health checks
aws route53 list-health-checks --query 'HealthChecks[*].{Id:Id,Config:HealthCheckConfig.FullyQualifiedDomainName}' --output table

# Get health check status
aws route53 get-health-check-status --health-check-id HC-ABC123

# Create a health check
aws route53 create-health-check \
  --caller-reference "app-$(date +%s)" \
  --health-check-config '{
    "FullyQualifiedDomainName": "app.example.com",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3
  }'

# Create failover records
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "primary",
        "Failover": "PRIMARY",
        "TTL": 60,
        "ResourceRecords": [{"Value": "10.0.2.100"}],
        "HealthCheckId": "HC-ABC123"
      }
    }]
  }'

CoreDNS in Kubernetes¶

Debugging Pod DNS¶

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS from inside a debug pod
kubectl run dnstest --image=busybox:1.36 --rm -it --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local

# More thorough test with dig
kubectl run dnstest --image=nicolaka/netshoot --rm -it --restart=Never -- \
  dig kubernetes.default.svc.cluster.local

# Check what a pod sees as its resolver
kubectl exec <pod> -- cat /etc/resolv.conf
# Output:
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# Test service resolution from a specific namespace
kubectl run dnstest -n my-namespace --image=busybox:1.36 --rm -it --restart=Never -- \
  nslookup my-service.my-namespace.svc.cluster.local

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

# Check CoreDNS metrics
kubectl port-forward -n kube-system svc/kube-dns 9153:9153
curl -s localhost:9153/metrics | grep coredns_dns_requests_total

Customizing CoreDNS¶

# Edit the Corefile to add custom forwarding
kubectl edit configmap coredns -n kube-system

# Add a server block for internal domains:
# internal.example.com:53 {
#     errors
#     cache 30
#     forward . 10.0.1.10 10.0.1.11
# }

# Restart CoreDNS to pick up changes
kubectl rollout restart deployment coredns -n kube-system

/etc/resolv.conf Debugging¶

Search Domain and ndots Interaction¶

# Given this resolv.conf:
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# Lookup "api.github.com" (2 dots, less than ndots:5)
# Resolution attempts:
# 1. api.github.com.default.svc.cluster.local  → NXDOMAIN
# 2. api.github.com.svc.cluster.local          → NXDOMAIN
# 3. api.github.com.cluster.local              → NXDOMAIN
# 4. api.github.com.                           → SUCCESS
# That is 4 wasted queries before the real one

# Lookup "app" (0 dots, less than ndots:5)
# 1. app.default.svc.cluster.local             → SUCCESS (if service exists)
# This is the intended behavior for short service names

# Lookup "app.example.com." (trailing dot = FQDN)
# Goes directly to authoritative — no search domain appended
# The trailing dot bypasses ndots entirely

Fixing resolv.conf Being Overwritten¶

# Check who manages resolv.conf
ls -la /etc/resolv.conf
# If it's a symlink → systemd-resolved or NetworkManager controls it

# systemd-resolved: configure via resolved.conf
cat /etc/systemd/resolved.conf
# [Resolve]
# DNS=10.0.1.10 10.0.1.11
# Domains=example.com
systemctl restart systemd-resolved

# NetworkManager: use nmcli
nmcli con mod "Wired connection 1" ipv4.dns "10.0.1.10 10.0.1.11"
nmcli con mod "Wired connection 1" ipv4.dns-search "example.com"
nmcli con up "Wired connection 1"

# DHCP client: use dhclient hooks
# /etc/dhcp/dhclient.conf
# prepend domain-name-servers 10.0.1.10, 10.0.1.11;
# supersede domain-search "example.com";

# Make resolv.conf immutable (last resort, fragile)
chattr +i /etc/resolv.conf
# WARNING: This breaks NetworkManager and systemd-resolved

DNS Round-Robin and Its Limitations¶

# Set up round-robin
# In zone file:
# app  300  IN  A  10.0.2.100
# app  300  IN  A  10.0.2.101
# app  300  IN  A  10.0.2.102

# Verify rotation
for i in $(seq 1 5); do dig +short app.example.com; echo "---"; done

# Limitations:
# 1. No health checking — dead backends still get traffic
# 2. Caching skews distribution — resolvers cache one ordering
# 3. Some clients always pick the first address
# 4. Connection count varies wildly (not true load balancing)
# 5. Session persistence is impossible

# Round-robin is acceptable for:
# - Distributing across healthy, equivalent backends
# - Low-stakes internal services
# - As a complement to (not replacement for) a load balancer

Clearing DNS Caches¶

When you need changes to take effect immediately, you need to clear caches at every layer:

# 1. Local resolver cache
# systemd-resolved:
sudo resolvectl flush-caches
resolvectl statistics  # verify cache size dropped to 0

# BIND (if running a local resolver):
sudo rndc flush

# Unbound:
sudo unbound-control flush_zone example.com
# Or flush everything:
sudo unbound-control flush_zone .

# dnsmasq:
sudo systemctl restart dnsmasq

# 2. Application-level caches
# Chrome: chrome://net-internals/#dns → Clear host cache
# Firefox: about:networking#dns → Clear DNS Cache
# curl: has no persistent cache (each invocation is fresh)
# Java: JVM caches DNS (networkaddress.cache.ttl in java.security)
# Python: no built-in DNS cache, but some libraries cache (requests-cache)

# 3. nscd (Name Service Cache Daemon, if running)
sudo nscd -i hosts
# Or restart it:
sudo systemctl restart nscd

# 4. OS-level (macOS)
# sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

# 5. OS-level (Windows)
# ipconfig /flushdns

Emergency: Total DNS Failure¶

# 1. Can the box reach any DNS server?
dig +short google.com @127.0.0.1    # Local resolver
dig +short google.com @10.0.1.10    # Internal resolver
dig +short google.com @8.8.8.8      # Public resolver

# 2. Is it a network problem or a DNS problem?
ping -c 1 8.8.8.8                   # Can we reach Google's IP?
# If ping works but DNS doesn't → DNS problem
# If ping fails → network problem

# 3. Check if DNS service is running
systemctl status systemd-resolved
systemctl status named
systemctl status unbound

# 4. Check firewall rules (is port 53 blocked?)
iptables -L -n | grep 53
ss -ulnp | grep :53
ss -tlnp | grep :53

# 5. Temporary workaround: point at public DNS
echo "nameserver 8.8.8.8" > /etc/resolv.conf
# WARNING: This bypasses internal DNS — internal names won't resolve

# 6. Check for DNS amplification / DDoS
tcpdump -n -i any port 53 -c 100 | head -30
# Look for unusually high query volumes from unexpected sources

Pattern: DNS Monitoring¶

# Prometheus metrics from CoreDNS
# coredns_dns_requests_total — total queries
# coredns_dns_responses_total — total responses by rcode
# coredns_dns_request_duration_seconds — latency histogram
# coredns_cache_hits_total — cache hit rate
# coredns_cache_misses_total — cache miss rate

# Alert on DNS errors
# expr: rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]) > 0.1

# External DNS check script
#!/bin/bash
DOMAINS="app.example.com api.example.com www.example.com"
for domain in $DOMAINS; do
  result=$(dig +short +time=2 +tries=1 $domain)
  if [ -z "$result" ]; then
    echo "ALERT: $domain resolution failed"
  fi
done