Portal | Level: L2: Operations | Topics: Kubernetes Debugging, Kubernetes Core | Domain: Kubernetes
Kubernetes Debugging Playbook - Primer¶
Why This Matters¶
Kubernetes does not tell you what is wrong. It tells you what it tried and where it stopped. Your job is to read the signals and trace back to root cause. Most K8s debugging follows predictable patterns once you know where to look.
Remember: The K8s debugging order: "SIRCN" — Scheduling, Image pull, Runtime, Config, Network. Each stage produces distinct error statuses. If the pod is Pending, it is stuck at Scheduling. If it is ImagePullBackOff, it is stuck at Image. If it is CrashLoopBackOff, the container starts but the Runtime/Config is wrong. If the pod is Running but not working, it is a Network or application issue.
Core Concepts¶
1. The Failure Cascade¶
Kubernetes failures flow through layers:
Each layer produces distinct signals. Start at the beginning and work forward.
2. Pod Not Starting: ImagePullBackOff¶
Under the hood: Kubernetes uses exponential backoff when retrying image pulls: 10s, 20s, 40s, up to 5 minutes. The status cycles between
ErrImagePull(active failure) andImagePullBackOff(waiting before retry). If you fix the issue (add imagePullSecrets, fix the tag), the pod will recover automatically on the next retry — you do not need to delete it.
The kubelet cannot pull the container image.
Common causes: wrong image name or tag, private registry without imagePullSecrets, registry down or rate-limited, network policy blocking egress.
3. Pod Not Starting: CrashLoopBackOff¶
The container starts, exits, and Kubernetes keeps restarting it with exponential backoff.
Common causes and exit codes: - 137 = OOMKilled (memory limit exceeded) - 1 = application error (read the logs) - 126 = permission denied on binary - 127 = binary not found
Remember: Exit code mnemonic: "137 = OOM, 143 = TERM, 1 = app, 127 = not found." 137 = 128 + signal 9 (SIGKILL from OOM killer). 143 = 128 + signal 15 (SIGTERM from graceful shutdown). Any exit code above 128 means the process was killed by a signal — subtract 128 to get the signal number.
If the container exits too fast for logs:
4. Pod Not Starting: Pending¶
The scheduler cannot place the pod on any node.
Common causes: insufficient CPU/memory for requests, node selector or affinity matching no nodes, taints without tolerations, unbound PVC, resource quotas exceeded.
5. The kubectl Debug Workflow¶
The universal first three commands:
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp
These answer 80% of questions. Read the Events section of describe output first.
Deeper investigation:
kubectl logs <pod> -c <container> -n <ns>
kubectl logs <pod> --previous --all-containers
kubectl logs -f <pod> -n <ns>
6. Exec and Ephemeral Debug Containers¶
Fun fact: Ephemeral debug containers (the
kubectl debugcommand) were added in Kubernetes v1.23 (GA in v1.25). Before this feature, debugging distroless containers required building a new image with shell tools baked in — defeating the purpose of minimal images. Ephemeral containers share the PID and network namespace of the target pod but have their own filesystem, so they can carry debugging tools without polluting the production image.
Get a shell in a running container:
Inside, check: ps aux, env | sort,
curl localhost:<port>/health,
nslookup <service-name>.
For distroless images without a shell:
For node-level debugging:
7. Resource Limits and OOMKill¶
When a container exceeds its memory limit, the kernel OOM killer terminates it (exit code 137).
Fix: increase memory limits, fix memory leaks, or check that requests match actual needs. Requests affect scheduling, limits affect runtime enforcement.
8. Network Debugging¶
Gotcha: Kubernetes DNS failures are one of the sneakiest debugging targets. If CoreDNS pods are not running (or are crashing), every service-to-service call fails with DNS resolution errors. Always check
kubectl get pods -n kube-system -l k8s-app=kube-dnsearly in your debugging flow. If CoreDNS is unhealthy, fix that first — everything else is a symptom.
DNS resolution:
nslookup <service>.<namespace>.svc.cluster.local
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
Service connectivity:
kubectl get endpoints <service> -n <ns>
# Empty = no pods match the selector
kubectl get svc <svc> -n <ns> -o yaml
kubectl get pods -n <ns> --show-labels
NetworkPolicy:
If any policy selects a pod, all non-matching traffic is denied by default.9. Node-Level Debugging¶
kubectl describe node <node>
# Check: MemoryPressure, DiskPressure, PIDPressure
journalctl -u kubelet --since "10 minutes ago"
kubectl top node
DiskPressure triggers pod eviction. NotReady means kubelet lost contact with the API server.
What Experienced People Know¶
describeEvents is the single most useful output. Read it before anything else.--previousshows logs from the last crash. Without it you see current (possibly empty) logs.- Exit code 137 = OOMKill. 143 = SIGTERM. 1 = app error.
- Empty Endpoints = no pods match the Service selector. Check labels carefully.
- DNS problems usually trace to CoreDNS, not app code.
kubectl debugwith ephemeral containers is the proper way to debug distroless images.- Requests affect scheduling. Limits affect runtime. Set both. Requests without limits let pods consume unbounded node resources.
kubectl get events --sort-by=.lastTimestampshows the failure timeline.- Pod stuck in Terminating = finalizer blocking or container ignoring SIGTERM.
- Increase verbosity with
kubectl get pods -v=6to see the API calls being made.
Wiki Navigation¶
Prerequisites¶
- Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1)
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core
Pages that link here¶
- Anti-Primer: Kubernetes Debugging Playbook
- Certification Prep: CKA — Certified Kubernetes Administrator
- Certification Prep: CKAD — Certified Kubernetes Application Developer
- Chaos Engineering & Fault Injection
- Kubernetes Debugging Playbook
- Master Curriculum: 40 Weeks
- Production Readiness Review: Study Plans
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms
- Symptoms