On-Call Survival: Networking¶
Print this. Pin it. Read it at 3 AM.
Alert: DNS Resolution Failure¶
Severity: P1 (service-wide) / P2 (single service)
First command:
What you're looking for:NXDOMAIN (name does not exist), connection timed out (DNS server unreachable), or SERVFAIL.
Decision tree:
Is it NXDOMAIN (name not found)?
├── Yes → Wrong hostname in config? Check service name and namespace.
│ kubectl get svc -n <ns> | grep <name>
│ Internal: use <svc>.<ns>.svc.cluster.local
└── No → Is DNS server unreachable (timeout)?
├── Yes → kubectl get pods -n kube-system -l k8s-app=kube-dns
│ CoreDNS pods down? → kubectl rollout restart deploy/coredns -n kube-system
└── No → Is it SERVFAIL?
├── Yes → kubectl logs -n kube-system -l k8s-app=kube-dns | tail -30
│ Upstream DNS issue? → Check /etc/resolv.conf on node
└── No → Escalate to platform team: "DNS failing for <hostname> from pod <pod>"
Escalation trigger: CoreDNS pods cannot be restarted; node-level DNS broken; external DNS unreachable across all nodes.
Safe actions: nslookup, dig, check CoreDNS logs and pod status.
Dangerous actions: Restart CoreDNS deployment (brief DNS blip for all pods in cluster).
Alert: TLS Certificate Error (503/SSL handshake failure)¶
Severity: P1 (user-facing HTTPS broken)
First command:
What you're looking for:notAfter= date (is the cert expired?), or Verify return code: 0 vs error code.
Decision tree:
Is the certificate expired?
├── Yes → Is cert-manager managing it?
│ kubectl get cert,certificaterequest -n <ns>
│ kubectl describe cert <name> -n <ns> | grep -A 10 Status
│ Cert not renewing? → kubectl delete certificaterequest <name> -n <ns> (forces reissue)
└── No → Is the cert for the wrong hostname (CN mismatch)?
├── Yes → Wrong cert attached to ingress? Check ingress TLS spec.
│ kubectl get ingress -n <ns> -o yaml | grep -A 5 tls
└── No → Is the CA not trusted?
├── Yes → Self-signed CA? Add to trust store or use public CA.
└── No → Escalate: "TLS handshake failing on <hostname>, openssl output: <paste>"
Escalation trigger: Cert expired and cert-manager cannot renew (ACME challenge failing, DNS not delegated); load balancer TLS termination broken.
Safe actions: Check cert dates, describe cert-manager resources, read ingress TLS config.
Dangerous actions: Delete CertificateRequest (triggers re-issue, brief gap), modify ingress TLS spec.
Alert: Ingress 404 / 502 / 503¶
Severity: P1 (all traffic affected) / P2 (partial)
First command:
What you're looking for: Backend service name and port, any misconfigured path rules, ingress class.Decision tree:
Is it 404 (not found)?
├── Yes → Path mismatch in ingress rules? Compare URL to ingress path spec.
│ Service exists? kubectl get svc <backend-svc> -n <ns>
└── No → Is it 502/503 (bad gateway / service unavailable)?
├── Yes → Are backend pods running?
│ kubectl get pods -n <ns> -l <selector from svc>
│ No pods ready? → Fix pods first (see Kubernetes guide)
└── No (502 with pods running) → Is it a timeout?
├── Yes → App too slow? kubectl top pod; check app metrics.
│ Readiness probe failing? kubectl describe pod | grep Readiness
└── No → Check ingress controller logs:
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | tail -50
Escalate if controller errors: "Ingress 502 on <host>, pods healthy, controller logs: <paste>"
Escalation trigger: Ingress controller pods down or crashing; all backends reporting unhealthy; load balancer health checks failing.
Safe actions: Describe ingress, check pod readiness, read ingress controller logs.
Dangerous actions: Edit ingress rules (may break other routes), restart ingress controller.
Alert: Load Balancer / Service Unreachable¶
Severity: P1
First command:
What you're looking for:EXTERNAL-IP — is it <pending> or assigned? Is the port correct?
Decision tree:
Is EXTERNAL-IP <pending>?
├── Yes → Cloud load balancer not provisioned.
│ kubectl describe svc <svc-name> -n <ns> | grep -A 5 Events
│ Cloud API quota hit? Check cloud console. Escalate to infra.
└── No (IP assigned but unreachable)?
├── Network policy blocking? → kubectl get netpol -n <ns>
│ kubectl describe netpol <name> -n <ns>
└── Security group / firewall rule missing?
→ Check cloud console for LB security group inbound rules.
→ Escalate to infra: "LB IP <ip> assigned but port <port> unreachable"
Escalation trigger: Cloud provider API errors, network policy changes needed, firewall rules blocked.
Safe actions: Get/describe service, check network policies, check endpoints.
Dangerous actions: Delete and recreate service (new IP, breaks DNS/firewall rules), edit network policies.
Quick Reference¶
Most Useful Commands¶
# DNS lookup from inside a pod
kubectl exec -n <ns> <pod> -- nslookup <hostname>
kubectl exec -n <ns> <pod> -- dig <hostname>
# Check certificate dates
echo | openssl s_client -connect <host>:443 2>/dev/null | openssl x509 -noout -dates
# Check ingress rules
kubectl get ingress -A
kubectl describe ingress <name> -n <ns>
# Check service endpoints (are pods actually registered?)
kubectl get endpoints <svc> -n <ns>
# Check network policies
kubectl get netpol -n <ns>
# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100
# Cert-manager certificate status
kubectl get cert -A
kubectl describe cert <name> -n <ns>
# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Trace a packet (from pod perspective)
kubectl exec -n <ns> <pod> -- curl -v http://<service>:<port>/health
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| CoreDNS down | Platform | #infra-oncall |
| Cert-manager broken / ACME failure | Platform / Security | #infra-oncall |
| Cloud LB / firewall | Infra | #infra-oncall |
| Ingress controller crash | Platform | PagerDuty: platform-critical |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| Describe ingress / service | Edit ingress routing rules |
| Read cert-manager status | Delete CertificateRequest |
| Check network policies | Edit network policies |
| Read ingress controller logs | Restart ingress controller |
| nslookup / dig from pods | Restart CoreDNS |
| Check endpoints | Delete and recreate LB service |