Portal | Level: L2: Operations | Topics: DNS, TLS & PKI | Domain: Networking
Scenario: DNS Resolves Correctly but Application Fails to Connect¶
Situation¶
At 09:15 UTC, the development team reports that their Node.js API service running in a Kubernetes pod cannot connect to orders-db.internal.example.com. The engineer on call runs dig from inside the pod and gets the correct IP address back. Yet the application logs show ENOTFOUND or connection timeouts to what appears to be a wrong address. The service was working yesterday before a routine cluster DNS configuration change.
What You Know¶
dig orders-db.internal.example.comreturns the correct A record (10.50.1.20) from inside the pod- The application logs show connections failing to an unexpected IPv6 address or timing out entirely
- The issue started after a CoreDNS ConfigMap update was applied to the cluster
- Other services in the same namespace can reach their databases without issues
- The pod is running Alpine-based Node.js 18 image
Investigation Steps¶
1. Compare dig output with actual application resolution behavior¶
Command(s):
# dig queries DNS directly and bypasses system resolver logic
dig orders-db.internal.example.com
dig AAAA orders-db.internal.example.com
# getent uses the system resolver (respects /etc/nsswitch.conf and /etc/hosts)
getent hosts orders-db.internal.example.com
# Check what the system resolver actually returns
getent ahosts orders-db.internal.example.com
dig may return the correct A record, but getent ahosts might return an AAAA (IPv6) record first. Many applications use getaddrinfo() which follows nsswitch.conf ordering and may prefer IPv6. If an AAAA record exists that points to a stale or unreachable IPv6 address, the app tries that first and fails.
2. Check resolver configuration files inside the pod¶
Command(s):
What to look for: In Kubernetes,/etc/resolv.conf contains ndots:5 by default. This means any name with fewer than 5 dots gets the search domains appended first. So orders-db.internal.example.com (3 dots) will first be tried as orders-db.internal.example.com.default.svc.cluster.local, then .svc.cluster.local, then .cluster.local, before finally trying the absolute name. Check if /etc/hosts has a stale entry overriding DNS. Check if nsswitch.conf has files dns ordering (hosts file checked before DNS). Check /etc/gai.conf for IPv6 preference rules.
3. Trace the actual DNS queries the system makes¶
Command(s):
# Watch real DNS queries leaving the pod
tcpdump -nn -i eth0 port 53
# In another shell, trigger a resolution the way the app does it
python3 -c "import socket; print(socket.getaddrinfo('orders-db.internal.example.com', 443))"
# Or with curl verbose to see resolution
curl -v --connect-timeout 5 http://orders-db.internal.example.com:5432/
ndots:5, you will see queries for orders-db.internal.example.com.default.svc.cluster.local and other search domain suffixes first. If any of those accidentally resolves (because a matching record exists in cluster DNS), the app connects to the wrong destination. Also look for both A and AAAA queries — if the AAAA responds faster with a routable address, the app may prefer it.
4. Test with a fully qualified domain name (trailing dot)¶
Command(s):
# FQDN with trailing dot bypasses search domain expansion
dig orders-db.internal.example.com.
python3 -c "import socket; print(socket.getaddrinfo('orders-db.internal.example.com.', 5432))"
# Check if ndots is the culprit by temporarily testing with ndots:1
# (Do not modify resolv.conf in prod; this is diagnostic only)
cat /etc/resolv.conf
ndots search domain expansion is causing a conflict. A service named orders-db might exist in cluster DNS under one of the search domains, intercepting the lookup.
Root Cause¶
The CoreDNS ConfigMap change adjusted search domains or ndots settings. With ndots:5 in /etc/resolv.conf, the pod's resolver appended Kubernetes search domains to orders-db.internal.example.com before trying the absolute name. A Kubernetes service named orders-db happened to exist in another namespace, so the query orders-db.internal.example.com.svc.cluster.local partially matched and returned a cluster IP. Simultaneously, an AAAA record from a previous IPv6 migration test was still present in the authoritative DNS, and getaddrinfo() preferred the IPv6 address, which was unreachable. The combination of search domain hijacking and stale AAAA records meant the application never reached the real database at 10.50.1.20.
Fix¶
Immediate:
# Option 1: Use FQDN with trailing dot in application config
# Change connection string to: orders-db.internal.example.com.
# Option 2: Lower ndots for this specific pod in the deployment spec
# In the pod spec:
# dnsConfig:
# options:
# - name: ndots
# value: "2"
# Option 3: Remove the stale AAAA record from authoritative DNS
# (Coordinate with DNS team)
# Option 4: Force IPv4 in application code or environment
# NODE_OPTIONS="--dns-result-order=ipv4first"
Preventive:
- Always use fully qualified domain names (with trailing dot) for external services referenced from Kubernetes pods.
- Set ndots to a lower value (1 or 2) in pod specs when pods primarily talk to external services, reducing unnecessary search domain queries.
- Audit and clean up stale AAAA records regularly. If IPv6 is not in use, do not leave test AAAA records in production DNS.
- Add a liveness check that verifies actual TCP connectivity to the database, not just DNS resolution.
- Document the interaction between nsswitch.conf, ndots, and IPv6 preference for the team.
Common Mistakes¶
- Trusting
digoutput as proof that "DNS is fine."digbypassesnsswitch.conf,/etc/hosts,ndotssearch domain logic, andgetaddrinfo()ordering. It is not what your application uses. - Ignoring AAAA records. Even if you think your infrastructure is IPv4-only, a stale AAAA record can cause
getaddrinfo()to try IPv6 first, adding timeouts or connecting to the wrong host. - Not understanding
ndotsin Kubernetes. This is one of the most common DNS debugging blind spots. Names with fewer dots than thendotsvalue get search domains appended first. - Changing
/etc/resolv.confdirectly in a pod. Kubernetes will overwrite it on pod restart. UsednsConfigin the pod spec instead.
Interview Angle¶
Q: dig returns the right answer but the application cannot connect. What is happening?
Good answer shape: Explain that dig queries the DNS server directly using its own resolver logic and does not go through the system's getaddrinfo() call. Applications use getaddrinfo(), which consults /etc/nsswitch.conf (checking /etc/hosts first if configured), respects /etc/resolv.conf search domains and ndots, and may prefer IPv6 (AAAA) over IPv4 (A). In Kubernetes, ndots:5 causes search domain expansion that can accidentally match cluster-internal services. The debugging approach is to use getent hosts instead of dig to see what the application actually resolves, check for AAAA records, inspect ndots and search domains, and test with a trailing-dot FQDN.
Wiki Navigation¶
Prerequisites¶
- Networking Deep Dive (Topic Pack, L1)
Related Content¶
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — DNS, TLS & PKI
- Networking Deep Dive (Topic Pack, L1) — DNS, TLS & PKI
- AWS Route 53 (Topic Pack, L2) — DNS
- Case Study: BMC Clock Skew Cert Failure (Case Study, L2) — TLS & PKI
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — DNS
- Case Study: DNS Resolution Slow (Case Study, L1) — DNS
- Case Study: DNS Split Horizon Confusion (Case Study, L2) — DNS
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — TLS & PKI
- Case Study: SSL Cert Chain Incomplete (Case Study, L1) — TLS & PKI
- Case Study: User Auth Failing — OIDC Cert Expired, Cloud KMS Rotation (Case Study, L2) — TLS & PKI
Pages that link here¶
- DHCP & IP Address Management
- DNS Split-Horizon Confusion
- HTTP Protocol - Primer
- Networking Domain
- Primer
- Runbook: TLS Certificate Expiry
- Symptoms: DNS Looks Broken, TLS Is Expired, Fix Is in Cert-Manager
- Symptoms: User Auth Failing, OIDC Cert Expired, Fix Is Cloud KMS Rotation
- TLS & Certificates Ops - Primer
- TLS Works From Some Clients But Fails From Others