- networking
- l1
- runbook
- dns
- networking-troubleshooting --- Portal | Level: L1: Foundations | Topics: DNS, Kubernetes Networking, Networking Troubleshooting | Domain: Networking
Runbook: DNS Resolution Failure¶
| Field | Value |
|---|---|
| Domain | Networking |
| Alert | dns_lookup_failures_total > 0 for >2 min or service name not resolving |
| Severity | P1 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, ability to exec into pods, cluster-admin role |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
kubectl get pods -n kube-system -l k8s-app=kube-dns
Running → Skip to Step 4 (check ConfigMap or service selectors)
If output shows: pods in CrashLoopBackOff or Pending → This is a CoreDNS crash, continue from Step 2
Step 1: Test DNS from Within a Pod¶
Why: DNS failures outside the cluster mean nothing — Kubernetes DNS only applies inside the cluster network.
# Exec into any running pod in the affected namespace
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- sh
# Once inside the pod, test lookup of an internal service
nslookup kubernetes.default.svc.cluster.local
# Also test an external name
nslookup google.com
# If nslookup is not available, use dig or curl
dig kubernetes.default.svc.cluster.local @<KUBE_DNS_SERVICE_IP>
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
Step 2: Check CoreDNS Pod Status¶
Why: If CoreDNS pods are not running, all DNS resolution inside the cluster fails.
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
# Check events for crash reasons
kubectl describe pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-xxxxxxxxx-xxxxx 1/1 Running 0 2d
coredns-xxxxxxxxx-yyyyy 1/1 Running 0 2d
kubectl describe node <NODE_NAME>.
Step 3: Check CoreDNS Logs for Errors¶
Why: Logs reveal whether CoreDNS is crashing due to a config error, upstream failure, or resource exhaustion.
# Get logs from the first CoreDNS pod
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
# If there are multiple pods, check a specific one
kubectl logs -n kube-system <COREDNS_POD_NAME> --previous
[INFO] plugin/reload: Running configuration...
.:53
[INFO] plugin/ready: https://localhost:8181/ready
[INFO] Reloading complete
[ERROR] or FATAL. A config parse error means the ConfigMap is broken — go to Step 4. An upstream timeout means the forwarding target is unreachable.
Step 4: Check CoreDNS ConfigMap¶
Why: A misconfigured Corefile will cause CoreDNS to crash or silently drop queries.
Expected output:data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kubernetes block is missing or the forward directive points to a bad IP, edit the ConfigMap:
kubectl edit configmap coredns -n kube-system
# After saving, CoreDNS picks up the change automatically via the reload plugin
Step 5: Check Service Selectors Match Pod Labels¶
Why: If a service's selector does not match any pod labels, the service has no endpoints and DNS resolves to a name with no addresses.
# Get the service selector for the service that is failing
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o jsonpath='{.spec.selector}'
# List pods and their labels in the same namespace
kubectl get pods -n <NAMESPACE> --show-labels
# Check that endpoints exist for the service
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
ENDPOINTS shows <none>, the selector does not match. Fix the service selector or the pod labels.
Step 6: Verify kube-dns Service Exists and Has a ClusterIP¶
Why: Pods resolve DNS by querying the kube-dns ClusterIP. If this service is missing or has the wrong IP, all DNS breaks.
kubectl get svc kube-dns -n kube-system
# Verify the IP matches what pods use for DNS
kubectl get pods -n <NAMESPACE> <POD_NAME> -o jsonpath='{.spec.dnsConfig}'
# Also check the resolv.conf inside a pod
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- cat /etc/resolv.conf
nameserver in /etc/resolv.conf, something modified the kube-dns service. Recreate it from the cluster's original manifests or restore from backup.
Verification¶
# Confirm the issue is resolved — exec into an affected pod and test
kubectl exec -it <POD_NAME> -n <NAMESPACE> -- nslookup <SERVICE_NAME>.<NAMESPACE>.svc.cluster.local
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | Platform/Networking on-call | "Cluster-wide DNS failure, CoreDNS investigation ongoing, all service discovery down" |
| CoreDNS pods cannot be scheduled | Cluster Infrastructure team | "CoreDNS pods unschedulable, possible node or tainting issue, all DNS resolution down" |
| Scope expanding to multiple clusters | SRE lead | "Multi-cluster DNS failure, possible shared infrastructure issue (CNI/node networking)" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
Common Mistakes¶
- Testing DNS from outside the cluster:
nslookup my-servicerun on your laptop will fail because cluster DNS is internal-only. Always exec into a pod first. - Checking the wrong namespace: A service in
namespace-ais not visible asmy-servicefrom a pod innamespace-b— the FQDN must include the namespace:my-service.namespace-a.svc.cluster.local. - Confusing the FQDN format: The correct format is
<service>.<namespace>.svc.cluster.local. Missing thesvcsegment or using the wrong domain suffix causes lookup failures that look like CoreDNS bugs.
Tips and Gotchas¶
- The full internal DNS name is
<svc>.<namespace>.svc.cluster.local. Short names work because ofndots:5and the search domains in/etc/resolv.conf. ndots:5means any name with fewer than 5 dots triggers search-domain expansion first — this can cause surprising lookup chains and latency.- CoreDNS caches responses for 30s by default; after fixing a Service, DNS may still return stale data for up to 30 seconds.
- A NetworkPolicy with egress rules that forgets port 53/UDP silently blocks DNS for affected pods.
Cross-References¶
- Topic Pack: Kubernetes DNS and Service Discovery (deep background)
- Related Runbook: Network Policy Block
- Exercise:
training/interactive/exercises/levels/level-23/k8s-dns/ - Incident Scenario:
training/interactive/incidents/scenarios/dns-bad-service-name.sh
Wiki Navigation¶
Related Content¶
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — DNS, Kubernetes Networking
- Ops Archaeology: The 5% That Can't Resolve (Case Study, L2) — DNS, Kubernetes Networking
- API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
- AWS Route 53 (Topic Pack, L2) — DNS
- Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — DNS
- Case Study: DNS Resolution Slow (Case Study, L1) — DNS
- Case Study: DNS Split Horizon Confusion (Case Study, L2) — DNS
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Kubernetes Networking
Pages that link here¶
- DHCP & IP Address Management
- DNS Operations
- DNS Operations - Street-Level Ops
- DNS Split-Horizon Confusion
- Decision Tree: Latency Has Increased
- Kubernetes Ops Domain
- Level 3: Production Kubernetes
- Operational Runbooks
- Ops Archaeology: The 5% That Can't Resolve
- Runbook: Load Balancer Health Check Failure
- Runbook: MTU Mismatch
- Runbook: Network Partition (Split Brain / Partial Connectivity)
- Runbook: NetworkPolicy Blocking Traffic
- Symptoms