Skip to content

Answer Key: The 5% That Can't Resolve

The System

A payment processing platform with multiple microservices communicating over Kubernetes internal DNS:

[payment-worker pods (namespace: payments)]
        |                       |
   http://payment-api.payments:8080     http://inventory-api.inventory:8080
        |                       |
  [payment-api service]    [inventory-api service]
   (namespace: payments)    (namespace: inventory)

All inter-service communication uses Kubernetes DNS short names (e.g., service.namespace). CoreDNS resolves these by appending search domains from /etc/resolv.conf.

What's Broken

Root cause: The pod spec sets ndots: 15, which is extremely high. The ndots option in /etc/resolv.conf controls when the resolver treats a name as "fully qualified" vs "relative." If a name has fewer dots than ndots, the resolver appends each search domain before trying the name as-is.

For the name payment-api.payments (1 dot, which is < 15): 1. Try payment-api.payments.default.svc.cluster.local -- NXDOMAIN (wrong namespace) 2. Try payment-api.payments.svc.cluster.local -- SUCCESS 3. Try payment-api.payments.cluster.local -- (skipped, already resolved) 4. Try payment-api.payments.ec2.internal -- (skipped)

The problem: step 1 generates an NXDOMAIN response. In some cases, depending on the DNS client implementation, UDP packet ordering, and concurrency, the NXDOMAIN from step 1 arrives before the success from step 2, and the client treats it as a failure. This creates intermittent ~5% DNS failures.

Additionally, every DNS query generates multiple upstream lookups (one per search domain), overloading CoreDNS with 5x the necessary queries and contributing to the 482,914 upstream forwards.

Key clue: The CoreDNS log shows the query was for payment-api.payments.default.svc.cluster.local — note the default namespace insertion. This is the search domain expansion of payment-api.payments with the first search path (default.svc.cluster.local).

The Fix

Immediate

Set a sane ndots value:

kubectl patch deployment payment-worker -n payments --type='json' -p='[
  {"op":"replace","path":"/spec/template/spec/dnsConfig/options/0/value","value":"2"}
]'

Or use fully qualified domain names in the environment variables (with trailing dot):

env:
  - name: PAYMENT_API_URL
    value: "http://payment-api.payments.svc.cluster.local.:8080/api/v1/charge"

Permanent

Fix the pod spec in the Helm values:

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"    # was 15 — only names with <2 dots use search domains

Or better, use FQDNs in all service URLs:

env:
  - name: PAYMENT_API_URL
    value: "http://payment-api.payments.svc.cluster.local:8080/api/v1/charge"
  - name: INVENTORY_API_URL
    value: "http://inventory-api.inventory.svc.cluster.local:8080/api/v1/stock"

Verification

# Verify ndots is updated
kubectl exec -n payments deploy/payment-worker -- cat /etc/resolv.conf

# Test resolution of short names
kubectl exec -n payments deploy/payment-worker -- nslookup payment-api.payments

# Monitor NXDOMAIN rate (should drop to near zero)
kubectl exec -n kube-system deploy/coredns -- curl -s http://localhost:9153/metrics | grep NXDOMAIN

# Check payment-worker error logs
kubectl logs -n payments deploy/payment-worker --tail=50 | grep "DNS resolution failed"

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output FQDN resolves, short name fails; /etc/resolv.conf shows ndots:15 CoreDNS pods are healthy and nslookup with FQDN works, making DNS look fine
Metrics 5% NXDOMAIN rate and high upstream forward count = excessive DNS queries Cache hit ratio looks acceptable, hiding the unnecessary query amplification
IaC Snippet ndots: "15" in pod spec is the root cause; short service names in env vars trigger the bug The service URLs look like standard Kubernetes patterns (service.namespace)
Log Lines CoreDNS log shows the query was expanded to wrong namespace (.default.svc.cluster.local) payment-worker success log shows FQDN worked — intermittent failures are hard to catch

Skills Demonstrated

  • Understanding Kubernetes DNS resolution and /etc/resolv.conf semantics
  • Knowing how ndots and search domains interact
  • Interpreting CoreDNS logs and NXDOMAIN responses
  • Recognizing intermittent failures caused by DNS client behavior
  • Understanding the performance impact of DNS query amplification

Prerequisite Topic Packs