kubectl Debugging Decision Flow¶
Systematic approach to diagnosing Kubernetes issues. Follow top-down.
START: Something is wrong
|
v
[1] CHECK PODS
kubectl get pods -n <ns>
kubectl get pods -n <ns> -o wide
|
+-- CrashLoopBackOff? --> [2a] LOGS
+-- ImagePullBackOff? --> [2b] IMAGE
+-- Pending? --> [2c] SCHEDULING
+-- Running but 0/1? --> [2d] PROBES
+-- Running 1/1? --> [3] SERVICE LAYER
|
v
[2a] LOGS (CrashLoopBackOff)
kubectl logs -n <ns> <pod> --previous
kubectl describe pod -n <ns> <pod>
|
+-- OOMKilled in Last State? --> check resource_limits, increase memory
+-- Exit code 1? --> bad command/config, check entrypoint
+-- Liveness probe failed? --> adjust probe timing
See: training/library/runbooks/crashloopbackoff.md
training/library/runbooks/oomkilled.md
[2b] IMAGE (ImagePullBackOff)
kubectl describe pod -n <ns> <pod>
kubectl get deploy -n <ns> <name> -o jsonpath='{.spec.template.spec.containers[0].image}'
|
+-- Wrong tag? --> fix image tag in values/spec
+-- Local image? --> docker save | k3s ctr images import -
+-- Private registry? --> create imagePullSecret
See: training/library/runbooks/kubernetes/imagepullbackoff.md
[2c] SCHEDULING (Pending)
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns>
|
+-- Insufficient CPU/memory? --> check resource requests, node capacity
+-- No matching node? --> check nodeSelector, tolerations
+-- PVC not bound? --> check PV/StorageClass
[2d] PROBES (Running 0/1)
kubectl describe pod -n <ns> <pod> | grep -A10 Readiness
kubectl exec -n <ns> <pod> -- wget -qO- http://localhost:<port><path>
kubectl get endpoints -n <ns>
|
+-- Wrong path/port? --> fix probe spec
+-- App not started yet? --> increase initialDelaySeconds
+-- Dependency missing? --> check dependent services
See: training/library/runbooks/kubernetes/readiness_probe_failed.md
[3] SERVICE LAYER
kubectl get svc -n <ns>
kubectl get endpoints -n <ns>
|
+-- 0 endpoints? --> [3a] SELECTOR MISMATCH
+-- Has endpoints --> [4] DNS / NETWORK
|
v
[3a] SELECTOR MISMATCH
kubectl get svc <svc> -n <ns> -o yaml | grep -A5 selector
kubectl get pods -n <ns> --show-labels
|
+-- Labels don't match? --> fix service selector or pod labels
See: training/interactive/incidents/scenarios/service-selector-mismatch.sh
[4] DNS / NETWORK
kubectl run dns-test -n <ns> --rm -i --restart=Never --image=busybox:1.36 -- nslookup <svc>
kubectl exec -n <ns> <pod> -- wget -qO- --timeout=3 http://<svc>/health
|
+-- DNS fails? --> [4a] DNS
+-- Connection fails? --> [4b] NETWORK POLICY
+-- Works from pod? --> [5] HPA / SCALING
|
v
[4a] DNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl exec -n <ns> <pod> -- cat /etc/resolv.conf
See: training/library/runbooks/dns_resolution.md
[4b] NETWORK POLICY
kubectl get networkpolicy -n <ns>
kubectl describe networkpolicy -n <ns>
See: training/library/runbooks/kubernetes/networkpolicy_block.md
[5] HPA / SCALING
kubectl get hpa -n <ns>
kubectl describe hpa <name> -n <ns>
kubectl top pods -n <ns>
|
+-- <unknown> metrics? --> check metrics-server, resource requests
+-- Not scaling up? --> check target utilization, stabilization window
See: training/library/runbooks/kubernetes/hpa_not_scaling.md
[6] OBSERVABILITY
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
|
+-- Target down? --> [6a] PROMETHEUS
+-- No logs? --> [6b] LOKI
+-- No traces? --> [6c] TEMPO
|
v
[6a] PROMETHEUS TARGET
kubectl get servicemonitor -n <ns> -o yaml
kubectl get svc -n <ns> --show-labels
See: training/library/runbooks/prometheus_target_down.md
[6b] LOKI LOGS
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
kubectl logs -n monitoring <promtail-pod>
See: training/library/runbooks/observability/loki_no_logs.md
[6c] TEMPO TRACES
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl http://localhost:3200/ready
See: training/library/runbooks/observability/tempo_no_traces.md
Quick Reference¶
| Symptom | First Command | Then |
|---|---|---|
| Pod not starting | kubectl describe pod |
Check events section |
| App crashing | kubectl logs --previous |
Check exit code |
| No traffic reaching app | kubectl get endpoints |
Check selector match |
| DNS failures | kubectl run dns-test ... -- nslookup <svc> |
Check CoreDNS |
| Metrics missing | kubectl get servicemonitor -o yaml |
Compare labels |
| Logs missing | kubectl get pods -n monitoring -l promtail |
Check DaemonSet |
| HPA stuck | kubectl describe hpa |
Check metrics-server |
| Helm broke things | helm history then helm rollback |
Fix values |