kubectl Debugging Decision Flow¶

Systematic approach to diagnosing Kubernetes issues. Follow top-down.

START: Something is wrong
  |
  v
[1] CHECK PODS
  kubectl get pods -n <ns>
  kubectl get pods -n <ns> -o wide
  |
  +-- CrashLoopBackOff? --> [2a] LOGS
  +-- ImagePullBackOff? --> [2b] IMAGE
  +-- Pending?          --> [2c] SCHEDULING
  +-- Running but 0/1?  --> [2d] PROBES
  +-- Running 1/1?      --> [3] SERVICE LAYER
  |
  v

[2a] LOGS (CrashLoopBackOff)
  kubectl logs -n <ns> <pod> --previous
  kubectl describe pod -n <ns> <pod>
  |
  +-- OOMKilled in Last State? --> check resource_limits, increase memory
  +-- Exit code 1?             --> bad command/config, check entrypoint
  +-- Liveness probe failed?   --> adjust probe timing
  See: training/library/runbooks/crashloopbackoff.md
       training/library/runbooks/oomkilled.md

[2b] IMAGE (ImagePullBackOff)
  kubectl describe pod -n <ns> <pod>
  kubectl get deploy -n <ns> <name> -o jsonpath='{.spec.template.spec.containers[0].image}'
  |
  +-- Wrong tag?           --> fix image tag in values/spec
  +-- Local image?         --> docker save | k3s ctr images import -
  +-- Private registry?    --> create imagePullSecret
  See: training/library/runbooks/kubernetes/imagepullbackoff.md

[2c] SCHEDULING (Pending)
  kubectl describe pod -n <ns> <pod>
  kubectl get events -n <ns>
  |
  +-- Insufficient CPU/memory? --> check resource requests, node capacity
  +-- No matching node?        --> check nodeSelector, tolerations
  +-- PVC not bound?           --> check PV/StorageClass

[2d] PROBES (Running 0/1)
  kubectl describe pod -n <ns> <pod> | grep -A10 Readiness
  kubectl exec -n <ns> <pod> -- wget -qO- http://localhost:<port><path>
  kubectl get endpoints -n <ns>
  |
  +-- Wrong path/port?       --> fix probe spec
  +-- App not started yet?   --> increase initialDelaySeconds
  +-- Dependency missing?    --> check dependent services
  See: training/library/runbooks/kubernetes/readiness_probe_failed.md

[3] SERVICE LAYER
  kubectl get svc -n <ns>
  kubectl get endpoints -n <ns>
  |
  +-- 0 endpoints?  --> [3a] SELECTOR MISMATCH
  +-- Has endpoints --> [4] DNS / NETWORK
  |
  v

[3a] SELECTOR MISMATCH
  kubectl get svc <svc> -n <ns> -o yaml | grep -A5 selector
  kubectl get pods -n <ns> --show-labels
  |
  +-- Labels don't match? --> fix service selector or pod labels
  See: training/interactive/incidents/scenarios/service-selector-mismatch.sh

[4] DNS / NETWORK
  kubectl run dns-test -n <ns> --rm -i --restart=Never --image=busybox:1.36 -- nslookup <svc>
  kubectl exec -n <ns> <pod> -- wget -qO- --timeout=3 http://<svc>/health
  |
  +-- DNS fails?       --> [4a] DNS
  +-- Connection fails? --> [4b] NETWORK POLICY
  +-- Works from pod?   --> [5] HPA / SCALING
  |
  v

[4a] DNS
  kubectl get pods -n kube-system -l k8s-app=kube-dns
  kubectl exec -n <ns> <pod> -- cat /etc/resolv.conf
  See: training/library/runbooks/dns_resolution.md

[4b] NETWORK POLICY
  kubectl get networkpolicy -n <ns>
  kubectl describe networkpolicy -n <ns>
  See: training/library/runbooks/kubernetes/networkpolicy_block.md

[5] HPA / SCALING
  kubectl get hpa -n <ns>
  kubectl describe hpa <name> -n <ns>
  kubectl top pods -n <ns>
  |
  +-- <unknown> metrics? --> check metrics-server, resource requests
  +-- Not scaling up?    --> check target utilization, stabilization window
  See: training/library/runbooks/kubernetes/hpa_not_scaling.md

[6] OBSERVABILITY
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
  kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
  |
  +-- Target down?   --> [6a] PROMETHEUS
  +-- No logs?       --> [6b] LOKI
  +-- No traces?     --> [6c] TEMPO
  |
  v

[6a] PROMETHEUS TARGET
  kubectl get servicemonitor -n <ns> -o yaml
  kubectl get svc -n <ns> --show-labels
  See: training/library/runbooks/prometheus_target_down.md

[6b] LOKI LOGS
  kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
  kubectl logs -n monitoring <promtail-pod>
  See: training/library/runbooks/observability/loki_no_logs.md

[6c] TEMPO TRACES
  kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
  kubectl port-forward -n monitoring svc/tempo 3200:3200
  curl http://localhost:3200/ready
  See: training/library/runbooks/observability/tempo_no_traces.md

Quick Reference¶

Symptom	First Command	Then
Pod not starting	`kubectl describe pod`	Check events section
App crashing	`kubectl logs --previous`	Check exit code
No traffic reaching app	`kubectl get endpoints`	Check selector match
DNS failures	`kubectl run dns-test ... -- nslookup <svc>`	Check CoreDNS
Metrics missing	`kubectl get servicemonitor -o yaml`	Compare labels
Logs missing	`kubectl get pods -n monitoring -l promtail`	Check DaemonSet
HPA stuck	`kubectl describe hpa`	Check metrics-server
Helm broke things	`helm history` then `helm rollback`	Fix values

kubectl Debugging Decision Flow¶

Quick Reference¶

Pages that link here¶