Skip to content

Solution: DNS Resolution Taking 5+ Seconds Intermittently

Triage

  1. Check current DNS configuration:

    cat /etc/resolv.conf
    
    Note the order of nameservers.

  2. Test each nameserver individually with timing:

    dig @10.0.0.10 example.com +stats    # primary
    dig @10.0.0.11 example.com +stats    # secondary
    

  3. Check if the primary is reachable at all:

    ping -c 3 10.0.0.10
    nc -zvw3 10.0.0.10 53
    

  4. Capture DNS traffic to see the fallback pattern:

    tcpdump -i eth0 -nn port 53 -w /tmp/dns.pcap &
    dig example.com
    
    Observe: query sent to primary, no response, timeout, query to secondary, response.

  5. Test with explicit timeout to confirm:

    dig @10.0.0.10 example.com +timeout=2 +tries=1
    
    Should return "connection timed out; no servers could be reached."

Root Cause

The primary DNS server (10.0.0.10) listed first in /etc/resolv.conf is unreachable. It was decommissioned or has a network/firewall issue. The resolver library sends each query to the primary first, waits for the default timeout (5 seconds), then falls back to the secondary (10.0.0.11), which responds normally.

This results in a 5-second penalty on every DNS lookup that is not cached. Cached lookups appear fast because the resolver returns from cache without contacting any server. As cache entries expire, more queries hit the timeout path, which explains the intermittent nature.

Fix

Immediate -- Update resolv.conf:

# Remove or comment out the dead primary, promote secondary
nameserver 10.0.0.11
nameserver 10.0.0.12

If managed by DHCP or a configuration management tool, update the DHCP server options or Puppet/Ansible/Chef configuration to distribute corrected nameservers.

Long-term: 1. Investigate the primary DNS server -- restore it or decommission it properly. 2. Deploy a local caching resolver (systemd-resolved, dnsmasq, unbound) on each host so that a single upstream failure only causes a one-time penalty per TTL. 3. Add monitoring/alerting for DNS resolution latency:

dig @10.0.0.10 health.check +timeout=2 +tries=1

Rollback / Safety

  • Editing /etc/resolv.conf is non-disruptive; it takes effect on the next lookup.
  • If resolv.conf is managed by DHCP (check for "generated by" comments), direct edits will be overwritten on next DHCP renewal. Fix the DHCP server instead.
  • Keep a backup of the original resolv.conf before editing.
  • Verify resolution works after the change: dig example.com +short

Common Traps

  • Editing /etc/resolv.conf when it is managed by NetworkManager or systemd-resolved; the edit gets overwritten. Must configure the managing service instead.
  • Assuming the problem is with DNS itself rather than reachability of a specific server -- the secondary resolves fine, proving DNS data is correct.
  • Using the rotate option in resolv.conf as a "fix" -- this only distributes the pain across queries rather than eliminating it.
  • Not checking for IPv6 AAAA query issues that can also cause similar delays (dual-stack resolution problems).
  • Overlooking that /etc/nsswitch.conf controls whether DNS is even used (vs. files, mDNS, etc.).