Skip to content

Portal | Level: L2: Operations | Topics: DNS, TLS & PKI | Domain: Networking

Scenario: DNS Resolves Correctly but Application Fails to Connect

Situation

At 09:15 UTC, the development team reports that their Node.js API service running in a Kubernetes pod cannot connect to orders-db.internal.example.com. The engineer on call runs dig from inside the pod and gets the correct IP address back. Yet the application logs show ENOTFOUND or connection timeouts to what appears to be a wrong address. The service was working yesterday before a routine cluster DNS configuration change.

What You Know

  • dig orders-db.internal.example.com returns the correct A record (10.50.1.20) from inside the pod
  • The application logs show connections failing to an unexpected IPv6 address or timing out entirely
  • The issue started after a CoreDNS ConfigMap update was applied to the cluster
  • Other services in the same namespace can reach their databases without issues
  • The pod is running Alpine-based Node.js 18 image

Investigation Steps

1. Compare dig output with actual application resolution behavior

Command(s):

# dig queries DNS directly and bypasses system resolver logic
dig orders-db.internal.example.com
dig AAAA orders-db.internal.example.com

# getent uses the system resolver (respects /etc/nsswitch.conf and /etc/hosts)
getent hosts orders-db.internal.example.com

# Check what the system resolver actually returns
getent ahosts orders-db.internal.example.com
What to look for: dig may return the correct A record, but getent ahosts might return an AAAA (IPv6) record first. Many applications use getaddrinfo() which follows nsswitch.conf ordering and may prefer IPv6. If an AAAA record exists that points to a stale or unreachable IPv6 address, the app tries that first and fails.

2. Check resolver configuration files inside the pod

Command(s):

cat /etc/resolv.conf
cat /etc/nsswitch.conf
cat /etc/hosts
cat /etc/gai.conf 2>/dev/null
What to look for: In Kubernetes, /etc/resolv.conf contains ndots:5 by default. This means any name with fewer than 5 dots gets the search domains appended first. So orders-db.internal.example.com (3 dots) will first be tried as orders-db.internal.example.com.default.svc.cluster.local, then .svc.cluster.local, then .cluster.local, before finally trying the absolute name. Check if /etc/hosts has a stale entry overriding DNS. Check if nsswitch.conf has files dns ordering (hosts file checked before DNS). Check /etc/gai.conf for IPv6 preference rules.

3. Trace the actual DNS queries the system makes

Command(s):

# Watch real DNS queries leaving the pod
tcpdump -nn -i eth0 port 53

# In another shell, trigger a resolution the way the app does it
python3 -c "import socket; print(socket.getaddrinfo('orders-db.internal.example.com', 443))"

# Or with curl verbose to see resolution
curl -v --connect-timeout 5 http://orders-db.internal.example.com:5432/
What to look for: In the tcpdump output, count how many DNS queries fire before the correct one. With ndots:5, you will see queries for orders-db.internal.example.com.default.svc.cluster.local and other search domain suffixes first. If any of those accidentally resolves (because a matching record exists in cluster DNS), the app connects to the wrong destination. Also look for both A and AAAA queries — if the AAAA responds faster with a routable address, the app may prefer it.

4. Test with a fully qualified domain name (trailing dot)

Command(s):

# FQDN with trailing dot bypasses search domain expansion
dig orders-db.internal.example.com.
python3 -c "import socket; print(socket.getaddrinfo('orders-db.internal.example.com.', 5432))"

# Check if ndots is the culprit by temporarily testing with ndots:1
# (Do not modify resolv.conf in prod; this is diagnostic only)
cat /etc/resolv.conf
What to look for: If the trailing-dot version works immediately and returns the correct address, then the ndots search domain expansion is causing a conflict. A service named orders-db might exist in cluster DNS under one of the search domains, intercepting the lookup.

Root Cause

The CoreDNS ConfigMap change adjusted search domains or ndots settings. With ndots:5 in /etc/resolv.conf, the pod's resolver appended Kubernetes search domains to orders-db.internal.example.com before trying the absolute name. A Kubernetes service named orders-db happened to exist in another namespace, so the query orders-db.internal.example.com.svc.cluster.local partially matched and returned a cluster IP. Simultaneously, an AAAA record from a previous IPv6 migration test was still present in the authoritative DNS, and getaddrinfo() preferred the IPv6 address, which was unreachable. The combination of search domain hijacking and stale AAAA records meant the application never reached the real database at 10.50.1.20.

Fix

Immediate:

# Option 1: Use FQDN with trailing dot in application config
# Change connection string to: orders-db.internal.example.com.

# Option 2: Lower ndots for this specific pod in the deployment spec
# In the pod spec:
#   dnsConfig:
#     options:
#       - name: ndots
#         value: "2"

# Option 3: Remove the stale AAAA record from authoritative DNS
# (Coordinate with DNS team)

# Option 4: Force IPv4 in application code or environment
# NODE_OPTIONS="--dns-result-order=ipv4first"

Preventive: - Always use fully qualified domain names (with trailing dot) for external services referenced from Kubernetes pods. - Set ndots to a lower value (1 or 2) in pod specs when pods primarily talk to external services, reducing unnecessary search domain queries. - Audit and clean up stale AAAA records regularly. If IPv6 is not in use, do not leave test AAAA records in production DNS. - Add a liveness check that verifies actual TCP connectivity to the database, not just DNS resolution. - Document the interaction between nsswitch.conf, ndots, and IPv6 preference for the team.

Common Mistakes

  • Trusting dig output as proof that "DNS is fine." dig bypasses nsswitch.conf, /etc/hosts, ndots search domain logic, and getaddrinfo() ordering. It is not what your application uses.
  • Ignoring AAAA records. Even if you think your infrastructure is IPv4-only, a stale AAAA record can cause getaddrinfo() to try IPv6 first, adding timeouts or connecting to the wrong host.
  • Not understanding ndots in Kubernetes. This is one of the most common DNS debugging blind spots. Names with fewer dots than the ndots value get search domains appended first.
  • Changing /etc/resolv.conf directly in a pod. Kubernetes will overwrite it on pod restart. Use dnsConfig in the pod spec instead.

Interview Angle

Q: dig returns the right answer but the application cannot connect. What is happening? Good answer shape: Explain that dig queries the DNS server directly using its own resolver logic and does not go through the system's getaddrinfo() call. Applications use getaddrinfo(), which consults /etc/nsswitch.conf (checking /etc/hosts first if configured), respects /etc/resolv.conf search domains and ndots, and may prefer IPv6 (AAAA) over IPv4 (A). In Kubernetes, ndots:5 causes search domain expansion that can accidentally match cluster-internal services. The debugging approach is to use getent hosts instead of dig to see what the application actually resolves, check for AAAA records, inspect ndots and search domains, and test with a trailing-dot FQDN.


Wiki Navigation

Prerequisites