Kubernetes Debugging -- Street Ops¶
The Pod Failure Triage Flowchart¶
Pod not working?
|
+-- What phase is it in? (kubectl get pod)
|
+-- Pending
| +-- Check events: scheduling failures?
| +-- Insufficient CPU/memory? -> Check resource requests vs node capacity
| +-- No matching nodes? -> Check nodeSelector, affinity, taints/tolerations
| +-- PVC not bound? -> Check StorageClass, PV availability
|
+-- ImagePullBackOff / ErrImagePull
| +-- Wrong image name/tag? -> Check spelling, registry URL
| +-- Auth failure? -> Check imagePullSecrets, registry credentials
| +-- Image doesn't exist? -> Verify in registry directly
|
+-- CrashLoopBackOff
| +-- Check logs: kubectl logs <pod> --previous
| +-- App crashing on startup? -> Config error, missing env var, bad mount
| +-- OOMKilled? -> Check last state in describe output, raise memory limit
| +-- Liveness probe failing? -> Check probe config, app health endpoint
|
+-- Running but not Ready
| +-- Readiness probe failing? -> Check probe endpoint, port, timing
| +-- App slow to start? -> Adjust initialDelaySeconds or use startupProbe
|
+-- Running and Ready but not working
+-- Service selector mismatch? -> Compare labels
+-- Wrong port in service? -> Check targetPort vs containerPort
+-- NetworkPolicy blocking traffic? -> Check policies in namespace
+-- DNS issues? -> Test with nslookup from inside pod
What to Check First -- Decision Tree¶
Step 1: Get the status.
The STATUS column and RESTARTS column tell you the category of problem.Step 2: Read events.
Events are chronological. Read from bottom (newest) up.Step 3: Branch based on status.
Pending Pods¶
Pending means the scheduler cannot place the pod.
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check PVC status
kubectl get pvc -n <namespace>
Common causes: - Resource requests exceed available node capacity - All nodes tainted, pod has no matching toleration - PVC references nonexistent StorageClass - nodeSelector matches no nodes - Pod affinity/anti-affinity rules unsatisfiable
Heuristic: If a pod was Pending for > 5 minutes, the scheduler has given up retrying quickly. Check events for the FailedScheduling reason.
Debug clue: When
describe podshows "0/3 nodes are available: 3 Insufficient cpu," the bottleneck is resource requests, not actual usage. Nodes might be at 20% real CPU but 100% allocated. Check the gap between allocated and actual withkubectl top nodesvskubectl describe node | grep -A5 "Allocated". If there is a big gap, your resource requests are over-provisioned.
CrashLoopBackOff¶
The container starts and exits repeatedly. Backoff timer increases: 10s, 20s, 40s, up to 5 minutes.
# Current logs (if container is momentarily running)
kubectl logs <pod> -c <container>
# Previous crash logs (the money command)
kubectl logs <pod> -c <container> --previous
# Check exit code
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Exit code cheat sheet: - 0: App exited cleanly but K8s expected it to keep running - 1: Generic app error - 2: Shell misuse (bad command in entrypoint) - 126: Command not executable - 127: Command not found (wrong entrypoint/cmd) - 137: SIGKILL (OOMKilled or external kill) - 139: SIGSEGV (segfault) - 143: SIGTERM (graceful shutdown triggered)
ImagePullBackOff¶
# Check the exact image reference
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].image}'
# Check pull secrets
kubectl get pod <pod> -o jsonpath='{.spec.imagePullSecrets}'
# Verify secret exists and has correct data
kubectl get secret <secret-name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Common causes:
- Typo in image name or tag
- Tag does not exist (someone pushed to latest, you reference v1.2.3)
- Private registry, no imagePullSecret configured
- Secret references wrong registry URL
- Image was deleted from registry
OOMKilled¶
# Confirm OOM
kubectl describe pod <pod> | grep -A 3 "Last State"
# Look for: Reason: OOMKilled
# Check limits
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
Fix approach:
1. Check if app has a memory leak (monitor over time)
2. If legitimate usage, raise memory limit
3. Check if JVM/runtime has its own memory limit conflicting with container limit
4. For JVMs: -XX:MaxRAMPercentage=75.0 to respect cgroup limits
Eviction¶
# Check for eviction events
kubectl get events --field-selector reason=Evicted -n <namespace>
# Check node pressure
kubectl describe node <node> | grep -A 5 Conditions
Eviction happens when node runs low on disk, memory, or PIDs. Fix the resource hog or add node capacity.
Reading Events Effectively¶
Events decay after 1 hour by default. If you need history, use a monitoring stack.
# All events, newest first
kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Filter by pod
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>
# Watch live
kubectl get events -n <ns> --watch
Key event reasons to watch for:
- FailedScheduling -- scheduling problem
- FailedMount / FailedAttachVolume -- storage issue
- Pulling / Pulled / Failed -- image pull lifecycle
- Unhealthy -- probe failures
- BackOff -- crash loop or image pull backoff
- Evicted -- resource pressure
- OOMKilling -- memory limit breach
Using Ephemeral Containers¶
When a container has no shell (distroless, scratch), you cannot exec into it. Ephemeral containers solve this.
# Attach a debug container to a running pod
kubectl debug -it <pod> --image=busybox --target=<container>
# Debug with a full toolkit
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>
# Copy the pod with a different image (for debugging crashed pods)
kubectl debug <pod> -it --copy-to=debug-pod --container=debug --image=busybox
# Debug a node
kubectl debug node/<node-name> -it --image=busybox
The --target flag shares the process namespace with the specified container, so you can see its processes and filesystem at /proc/1/root/.
Debugging Without Exec Access¶
When RBAC blocks exec or the container has no shell:
- Logs are your primary tool. Increase app log verbosity if you can (env var, configmap change).
- Ephemeral containers (if cluster version >= 1.25 and feature enabled).
- Port-forward to test connectivity:
- Temporary debug pod in same namespace:
- Check configmaps and secrets the pod uses:
Node-Level Debugging with crictl¶
When kubectl is not giving you enough info, SSH to the node.
# List containers on this node
crictl ps -a
# Inspect a container
crictl inspect <container-id>
# Get container logs
crictl logs <container-id>
# List pods
crictl pods
# Check images
crictl images
# Check runtime status
crictl info
Use crictl when: - Kubelet is unhealthy and kubectl cannot reach the node - You need container-level details not exposed by the API - Debugging runtime-level issues (containerd/CRI-O)
Common Pitfalls¶
-
Not checking
--previouslogs. Current logs are empty because the container just restarted. The crash info is in previous logs. -
Confusing resource requests and limits. Requests affect scheduling. Limits affect runtime killing. A pod can be scheduled (requests fit) but OOMKilled (exceeds limit at runtime).
-
Forgetting namespace.
kubectl get podsshows default namespace only. Always use-nor-A. -
Ignoring init containers. If an init container fails, the main container never starts. Check init container status and logs separately.
-
Readiness vs liveness confusion. Readiness gates traffic. Liveness restarts the pod. A bad liveness probe causes restart loops. A bad readiness probe causes the pod to be excluded from service endpoints.
-
Not checking service selectors. Pod is Running and Ready but service returns 503. The service selector labels do not match the pod labels.
-
DNS caching. If you changed a service, old DNS records may be cached in pods. The default TTL is 30s but some apps cache longer.
-
Assuming the problem is in the pod. Sometimes the issue is the node (disk pressure, network, kubelet crash), the control plane (API server overloaded), or a network policy.
One-liner: The two most frequently missed debugging steps: (1) checking
--previouslogs on a CrashLoopBackOff pod, and (2) checking if a service's label selector actually matches the pod's labels. Together these account for roughly half of all "I can't figure out why it's broken" escalations.
Decision Matrix: Which Tool When¶
| Symptom | First Tool | Second Tool |
|---|---|---|
| Pod not starting | describe pod |
get events |
| Pod crashing | logs --previous |
describe pod |
| Pod running, not working | port-forward + curl |
exec or debug |
| Service unreachable | get endpoints |
run debug pod |
| Node issues | describe node |
crictl (SSH) |
| Slow pod | exec + top/strace |
kubectl top pod |
| Intermittent failures | get events --watch |
logs with timestamps |
The 5-Minute Incident Checklist¶
When paged about a K8s issue:
kubectl get pods -n <ns> -o wide-- scope the blast radiuskubectl describe pod <broken-pod>-- read Events sectionkubectl logs <pod> --previous-- if CrashLoopBackOffkubectl get events -n <ns> --sort-by='.lastTimestamp'-- broader contextkubectl top pods -n <ns>-- resource usage (if metrics-server installed)kubectl get nodes-- any nodes NotReady?
If none of that explains it, escalate to network/storage/control-plane investigation.