Answer Key: The 5% That Can't Resolve¶
The System¶
A payment processing platform with multiple microservices communicating over Kubernetes internal DNS:
[payment-worker pods (namespace: payments)]
| |
http://payment-api.payments:8080 http://inventory-api.inventory:8080
| |
[payment-api service] [inventory-api service]
(namespace: payments) (namespace: inventory)
All inter-service communication uses Kubernetes DNS short names (e.g., service.namespace). CoreDNS resolves these by appending search domains from /etc/resolv.conf.
What's Broken¶
Root cause: The pod spec sets ndots: 15, which is extremely high. The ndots option in /etc/resolv.conf controls when the resolver treats a name as "fully qualified" vs "relative." If a name has fewer dots than ndots, the resolver appends each search domain before trying the name as-is.
For the name payment-api.payments (1 dot, which is < 15):
1. Try payment-api.payments.default.svc.cluster.local -- NXDOMAIN (wrong namespace)
2. Try payment-api.payments.svc.cluster.local -- SUCCESS
3. Try payment-api.payments.cluster.local -- (skipped, already resolved)
4. Try payment-api.payments.ec2.internal -- (skipped)
The problem: step 1 generates an NXDOMAIN response. In some cases, depending on the DNS client implementation, UDP packet ordering, and concurrency, the NXDOMAIN from step 1 arrives before the success from step 2, and the client treats it as a failure. This creates intermittent ~5% DNS failures.
Additionally, every DNS query generates multiple upstream lookups (one per search domain), overloading CoreDNS with 5x the necessary queries and contributing to the 482,914 upstream forwards.
Key clue: The CoreDNS log shows the query was for payment-api.payments.default.svc.cluster.local — note the default namespace insertion. This is the search domain expansion of payment-api.payments with the first search path (default.svc.cluster.local).
The Fix¶
Immediate¶
Set a sane ndots value:
kubectl patch deployment payment-worker -n payments --type='json' -p='[
{"op":"replace","path":"/spec/template/spec/dnsConfig/options/0/value","value":"2"}
]'
Or use fully qualified domain names in the environment variables (with trailing dot):
env:
- name: PAYMENT_API_URL
value: "http://payment-api.payments.svc.cluster.local.:8080/api/v1/charge"
Permanent¶
Fix the pod spec in the Helm values:
spec:
dnsConfig:
options:
- name: ndots
value: "2" # was 15 — only names with <2 dots use search domains
Or better, use FQDNs in all service URLs:
env:
- name: PAYMENT_API_URL
value: "http://payment-api.payments.svc.cluster.local:8080/api/v1/charge"
- name: INVENTORY_API_URL
value: "http://inventory-api.inventory.svc.cluster.local:8080/api/v1/stock"
Verification¶
# Verify ndots is updated
kubectl exec -n payments deploy/payment-worker -- cat /etc/resolv.conf
# Test resolution of short names
kubectl exec -n payments deploy/payment-worker -- nslookup payment-api.payments
# Monitor NXDOMAIN rate (should drop to near zero)
kubectl exec -n kube-system deploy/coredns -- curl -s http://localhost:9153/metrics | grep NXDOMAIN
# Check payment-worker error logs
kubectl logs -n payments deploy/payment-worker --tail=50 | grep "DNS resolution failed"
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | FQDN resolves, short name fails; /etc/resolv.conf shows ndots:15 |
CoreDNS pods are healthy and nslookup with FQDN works, making DNS look fine |
| Metrics | 5% NXDOMAIN rate and high upstream forward count = excessive DNS queries | Cache hit ratio looks acceptable, hiding the unnecessary query amplification |
| IaC Snippet | ndots: "15" in pod spec is the root cause; short service names in env vars trigger the bug |
The service URLs look like standard Kubernetes patterns (service.namespace) |
| Log Lines | CoreDNS log shows the query was expanded to wrong namespace (.default.svc.cluster.local) |
payment-worker success log shows FQDN worked — intermittent failures are hard to catch |
Skills Demonstrated¶
- Understanding Kubernetes DNS resolution and
/etc/resolv.confsemantics - Knowing how
ndotsand search domains interact - Interpreting CoreDNS logs and NXDOMAIN responses
- Recognizing intermittent failures caused by DNS client behavior
- Understanding the performance impact of DNS query amplification