Kubernetes Debugging: When Pods Won't Behave
- lesson
- kubernetes-pod-lifecycle
- container-debugging
- probes
- scheduling
- resource-management
- dns
- storage
- rbac
- network-policies ---# Kubernetes Debugging: When Pods Won't Behave
Topics: Kubernetes pod lifecycle, container debugging, probes, scheduling, resource management, DNS, storage, RBAC, network policies Level: L1-L2 (Foundations to Operations) Time: 75-90 minutes Prerequisites: None (everything is explained inline)
The Mission¶
You just joined the on-call rotation for a Kubernetes cluster running five microservices. It's
Monday morning. You open your terminal and run kubectl get pods -n production:
NAME READY STATUS RESTARTS AGE
cart-service-7f8b9c4d5-x2k9m 1/1 Running 0 3d
inventory-api-5d6e7f8a9-q4r7p 0/1 Pending 0 47m
order-processor-8a9b0c1d2-m3n4 0/1 CrashLoopBackOff 7 (3m ago) 22m
payment-gateway-3e4f5a6b7-h8j9 0/1 ImagePullBackOff 0 15m
shipping-tracker-9c0d1e2f3-k5l6 1/1 Running 0 6h
Five pods. Three of them are in trouble. One is stuck and never scheduled. One keeps crashing.
One can't even pull its image. And somewhere in this cluster, two more problems are hiding
that kubectl get pods won't tell you about.
Your job: triage all five failures. Systematically.
By the end of this lesson, you'll have a mental framework for looking at any sick pod and knowing exactly what to check, in what order, and why.
Part 1: The Debugging Ladder¶
Before we touch a single pod, let's talk about sequence. Kubernetes debugging has a natural order, and fighting the order wastes time.
Step 1: kubectl get pods -o wide → What's broken, where is it?
Step 2: kubectl describe pod <name> → Why is it broken? (Events section)
Step 3: kubectl logs <pod> --previous → What did the app say before it died?
Step 4: kubectl exec / kubectl debug → Get inside and look around
Step 5: kubectl get events --sort-by=... → What else happened in this namespace?
Step 6: kubectl describe node <name> → Is the node sick?
Remember: The debugging ladder mnemonic is GDLEEN: Get, Describe, Logs, Exec, Events, Node. Start at the top. Most problems resolve by step 2 or 3. If you're at step 6, something unusual is happening.
The instinct is to jump straight to logs. Resist it. kubectl describe pod gives you the
infrastructure story: scheduling decisions, image pulls, probe failures, mount errors.
Logs give you the application story. You need the infrastructure story first because if the
container never started, there are no logs.
Let's climb the ladder for each broken pod.
Part 2: The Pending Pod — inventory-api¶
Go straight to the Events section at the bottom:
Warning FailedScheduling 47m default-scheduler 0/4 nodes are available:
1 node(s) had untolerated taint {gpu=true: NoSchedule},
2 node(s) had insufficient cpu, 1 node(s) had insufficient memory.
There it is. The scheduler tried every node and none worked. Let's decode that message:
| Count | Reason | What it means |
|---|---|---|
| 1 node | untolerated taint gpu=true:NoSchedule |
This node is reserved for GPU workloads; your pod doesn't have a matching toleration |
| 2 nodes | insufficient cpu | Your pod's CPU request exceeds what's available on these nodes |
| 1 node | insufficient memory | Your pod's memory request exceeds what's available |
Notice it says "insufficient cpu" — not "the node is out of CPU." These are requests, not actual usage. The nodes might be at 15% real CPU but 100% allocated.
# See the gap between requested and actual
kubectl top nodes
kubectl describe node worker-1 | grep -A 5 "Allocated resources"
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 3800m (95%) 8000m (200%)
memory 6Gi (78%) 12Gi (150%)
95% of CPU requested but kubectl top shows the node at 22% actual CPU. This is the
classic over-provisioning trap: teams set CPU requests too high, the scheduler thinks the
cluster is full, but the actual workload is light.
Gotcha: Requests affect scheduling. Limits affect runtime enforcement. A pod can be
Pending(requests don't fit) even when the cluster has plenty of actual capacity. The fix is either to right-size requests (use VPA recommendations) or add node capacity.
What about that taint?¶
NAME TAINTS
worker-1 <none>
worker-2 <none>
worker-3 [map[effect:NoSchedule key:gpu value:true]]
worker-4 <none>
The GPU node is tainted. Your pod doesn't have a toleration for it, so the scheduler skips it. If that node should run general workloads too, either remove the taint or add a toleration to the pod spec.
Fix: reduce CPU request or add capacity¶
# Before: over-provisioned # After: right-sized
resources: resources:
requests: requests:
cpu: 2000m # 2 full cores cpu: 500m # matches actual usage
memory: 4Gi memory: 1Gi
limits:
cpu: 2000m # can burst
memory: 4Gi
Under the Hood: The scheduler runs two phases: filtering (eliminate nodes that can't fit the pod) and scoring (rank the remaining ones). Taints, affinity, and resource requests are all filters. If every node gets filtered out, the pod stays Pending forever. The scheduler retries but never relaxes constraints.
Flashcard check¶
| Question | Answer |
|---|---|
| A pod is Pending. What do you check first? | kubectl describe pod — look for FailedScheduling in Events. |
| What causes "0/N nodes are available: N insufficient cpu"? | The pod's CPU request exceeds available (unrequested) CPU on every node. |
| A tainted node shows in the FailedScheduling message. What's missing? | The pod spec needs a matching toleration. |
Part 3: The CrashLoopBackOff — order-processor¶
This is the one everyone dreads. The container starts, dies, and Kubernetes retries — slower each time. Seven restarts in 22 minutes. Let's understand the timing.
The exponential backoff¶
When a container crashes, the kubelet waits before restarting it:
Crash 1 → wait 10s → restart
Crash 2 → wait 20s → restart
Crash 3 → wait 40s → restart
Crash 4 → wait 80s → restart
Crash 5 → wait 160s → restart
Crash 6 → wait 300s → restart (capped at 5 minutes)
Crash 7 → wait 300s → restart (stays at 5 minutes)
...
The backoff doubles each time: 10s, 20s, 40s, 80s, 160s, then caps at 300s (5 minutes). After
10 minutes of running successfully, the counter resets. This is why you see (3m ago) in the
RESTARTS column — the pod is currently in the 5-minute waiting period before its next attempt.
Trivia: The exponential backoff with a 5-minute cap has been in Kubernetes since the very early versions. The choice of 5 minutes (not 10, not 2) is a pragmatic balance: long enough to avoid hammering a failing dependency, short enough that a transient issue resolves within a reasonable time once the fix is in place. The backoff timer lives in the kubelet's in-memory state — restarting the kubelet resets it.
Now let's find out why it's crashing. Step 3 on the ladder — logs:
2026-03-23T08:14:22Z INFO Starting order-processor v3.2.1
2026-03-23T08:14:22Z INFO Connecting to RabbitMQ at amqp://rabbitmq.production.svc:5672
2026-03-23T08:14:23Z FATAL Cannot connect to RabbitMQ: connection refused
2026-03-23T08:14:23Z FATAL Exiting with code 1
The application can't reach RabbitMQ and exits. But is RabbitMQ actually down?
kubectl get endpoints rabbitmq -n production # → 10.244.2.15:5672 — RabbitMQ is healthy
kubectl run debug-net --rm -it --image=busybox -n production -- nc -zv 10.244.2.15 5672
# → open — other pods can connect fine
Something specific to order-processor is blocked. Time to check NetworkPolicy:
kubectl get networkpolicy -n production
# → restrict-egress app=order-processor 2h
kubectl describe networkpolicy restrict-egress -n production
The policy allows egress only on port 443/TCP. RabbitMQ is on port 5672. Someone created this policy two hours ago — right when the crashes started. The pod can't reach its broker.
Mental Model: NetworkPolicy works like a firewall — but with an important twist. If any NetworkPolicy selects a pod, all traffic not explicitly allowed is denied by default. This is the "default deny" behavior. A pod with zero NetworkPolicies is wide open. A pod with one NetworkPolicy that only allows port 443 is blocked on every other port.
Exit codes: your first clue¶
Before you even look at logs, the exit code narrows the search:
| Exit code | Signal | Meaning |
|---|---|---|
| 0 | — | Clean exit. If restartPolicy is Always, this still CrashLoops. |
| 1 | — | Generic application error. Read the logs. |
| 2 | — | Shell misuse (bad command in entrypoint). |
| 126 | — | Command exists but is not executable (permission issue). |
| 127 | — | Command not found (wrong entrypoint/CMD in Dockerfile). |
| 137 | SIGKILL (9) | OOMKilled or external kill. Check describe for reason. |
| 139 | SIGSEGV (11) | Segfault. Native code bug. |
| 143 | SIGTERM (15) | Graceful shutdown. Usually normal during rollouts. |
Remember: Any exit code above 128 = killed by a signal. Subtract 128 to get the signal number. 137 = 128 + 9 (SIGKILL). 143 = 128 + 15 (SIGTERM). This convention comes from the original Unix shell, not from Kubernetes.
# Get the exit code without parsing describe output
kubectl get pod order-processor-8a9b0c1d2-m3n4 -n production \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Flashcard check¶
| Question | Answer |
|---|---|
| What are the CrashLoopBackOff backoff intervals? | 10s, 20s, 40s, 80s, 160s, 300s (capped at 5 minutes). Resets after 10 minutes of running. |
| Exit code 137 means what? | SIGKILL (128 + 9). Usually OOMKilled. |
| A pod CrashLoops with exit code 0. Why? | restartPolicy: Always (the default) restarts even on clean exit. Use OnFailure or Never for batch work. |
Part 4: The ImagePullBackOff — payment-gateway¶
Events show:
Warning Failed 15m kubelet Failed to pull image "registry.internal.corp/payments/gateway:v4.1.0":
rpc error: code = Unknown desc = failed to pull and unpack image: 401 Unauthorized
The key phrase: 401 Unauthorized. The image exists, but the node can't authenticate.
# What imagePullSecrets does the pod reference?
kubectl get pod payment-gateway-3e4f5a6b7-h8j9 -n production \
-o jsonpath='{.spec.imagePullSecrets}'
# → [{"name":"registry-creds"}]
# Does that secret exist?
kubectl get secret registry-creds -n production
# → Error from server (NotFound): secrets "registry-creds" not found
The secret is referenced but doesn't exist in this namespace. Someone created it in default
and forgot to copy it to production.
Gotcha: ImagePullSecrets are namespace-scoped. A secret in
defaultis invisible to pods inproduction. This bites teams hard after namespace migrations. The pod references the secret by name, and if the name exists but contains stale or wrong credentials, the error is the same 401 — making it look like the secret doesn't exist when it's actually pointing at the wrong registry.
ImagePullBackOff also uses exponential backoff (10s to 5 minutes). The status alternates
between ErrImagePull (active failure) and ImagePullBackOff (waiting). Once you fix the
issue, recovery is automatic on the next retry.
Under the Hood: When no
imagePullSecretsare specified, the kubelet uses the node's container runtime credentials (~/.docker/config.json). This silently breaks when the autoscaler provisions new nodes — fresh nodes have no cached credentials. Always use explicitimagePullSecretsor a mutating webhook to inject them.
Part 5: The Silent Killer — OOMKilled¶
Cart-service looks healthy — Running, zero restarts. But users are reporting intermittent 502 errors. Let's look closer:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 23 Mar 2026 02:14:00 +0000
Finished: Mon, 23 Mar 2026 08:01:12 +0000
Ready: True
Restart Count: 0
Restart count 0 but OOMKilled? The pod was replaced by the Deployment controller after the OOM — this is a new pod that inherited the deployment's replica count but not the old pod's restart counter.
# What are the resource limits?
kubectl get pod cart-service-7f8b9c4d5-x2k9m -n production \
-o jsonpath='{.spec.containers[0].resources}'
256Mi memory limit. Let's check actual usage:
231Mi out of a 256Mi limit. That's 90% utilization. Any traffic spike or garbage collection pause will push it over.
Under the Hood: When a container exceeds its memory limit, it's not Kubernetes that kills it — it's the Linux kernel's OOM killer. The container's memory limit is enforced by a cgroup. The kernel monitors the cgroup's RSS (Resident Set Size), and when it exceeds the cgroup ceiling, the kernel sends SIGKILL (signal 9). That's why the exit code is 137 (128 + 9). There's no warning, no SIGTERM, no graceful shutdown. The process is dead.
Gotcha:
kubectl top podshowsMEMORY(bytes)which corresponds tocontainer_memory_working_set_bytesin Prometheus — this is what the OOM killer actually evaluates. Do not usecontainer_memory_usage_bytesfor capacity planning — it includes reclaimable page cache and overstates true pressure.
The JVM trap¶
A JVM with -Xmx512m in a container limited to 256Mi tries to allocate more heap than the
cgroup allows and gets killed before it finishes starting. Fix: java -XX:MaxRAMPercentage=75.0 -jar app.jar — this sizes the heap relative to the container's cgroup limit, leaving 25% for non-heap memory (class metadata, thread stacks, NIO buffers).
Part 6: The Probe Problem — shipping-tracker¶
Shipping-tracker is Running, zero restarts. But the monitoring dashboard shows it dropped off for 45 seconds this morning and came back. What happened?
kubectl get events -n production --field-selector involvedObject.name=shipping-tracker-9c0d1e2f3-k5l6
Events show Liveness probe failed: HTTP probe failed with statuscode: 503, followed by
Container shipping-tracker failed liveness probe, will be restarted. The probe config:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
startupProbe: null
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
The three probes and when each goes wrong¶
| Probe | Question it answers | Failure consequence | Common misconfiguration |
|---|---|---|---|
| Startup | "Has the app finished initializing?" | Blocks liveness/readiness until success | Not using one for slow-starting apps |
| Liveness | "Is the process alive and functional?" | Container is killed and restarted | Timeout too short, checking dependencies |
| Readiness | "Can this instance serve traffic right now?" | Removed from Service endpoints (no traffic) | Not distinguishing from liveness |
War Story: A team deployed a Java service that loaded ML models at startup (90 seconds). Liveness probe:
initialDelaySeconds: 10,failureThreshold: 3,periodSeconds: 10. First check at 10s, fails. Second at 20s, fails. Third at 30s — killed. New container starts, takes 90s, killed at 30s. Infinite loop. Fix: a startup probe withfailureThreshold: 30andperiodSeconds: 10(300 seconds to start). Three lines of YAML.
Back to shipping-tracker. The liveness probe checks /healthz with a 1-second timeout.
A GC pause or heavy query during the check window times out the probe. Three timeouts
(failureThreshold: 3 at periodSeconds: 10) = 30 seconds, and the kubelet kills it.
# Better: startup probe for initialization, generous liveness timeout
startupProbe:
httpGet: { path: /healthz, port: 8080 }
failureThreshold: 30 # Up to 300 seconds to start
periodSeconds: 10
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 10
timeoutSeconds: 5 # 5 seconds, not 1
failureThreshold: 6 # 60 seconds before killing
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Gotcha: Never make a liveness probe check external dependencies (database, message queue, downstream API). If the database is down, your liveness probe fails, Kubernetes restarts your pod, the new pod also can't reach the database, it gets restarted too — now you have a fleet of CrashLooping pods and the database recovers to find zero healthy backends. Liveness should check "is this process healthy?" — not "is the world healthy?" Put dependency checks in readiness probes instead.
Flashcard check¶
| Question | Answer |
|---|---|
| Liveness probe fails. What happens? | Container is killed and restarted. |
| Readiness probe fails. What happens? | Pod is removed from Service endpoints — no traffic routed to it, but it keeps running. |
| What is a startup probe for? | Slow-starting apps. It blocks liveness and readiness probes until the app is initialized. |
| Should a liveness probe check database connectivity? | No. Liveness should only check the process itself. Put dependency checks in readiness probes. |
Part 7: The Hidden Problems¶
We've fixed three obvious failures. Now let's find the two hidden ones.
Init container failure¶
There's a pod we missed:
The STATUS says Init:Error. The main container never started because an init container
failed. Init containers run sequentially — if any fails, the pod never reaches Running.
Init Containers:
wait-for-db: State: Terminated Reason: Completed Exit Code: 0
run-migrations: State: Terminated Reason: Error Exit Code: 1
wait-for-db passed, but run-migrations failed.
# Get the migration logs
kubectl logs notification-svc-4b5c6d7e8-f9a0 -n production -c run-migrations
Running database migrations...
ERROR: permission denied for table notifications
DETAIL: User "app_user" does not have INSERT privilege on "schema_migrations"
An RBAC issue — not Kubernetes RBAC, but database-level permissions. The migration user needs INSERT privileges on the schema_migrations table.
Gotcha: Init container failures are invisible if you only watch the main containers.
kubectl get podsshowsInit:ErrororInit:CrashLoopBackOffin the STATUS column, but it's easy to miss. Always check init containers when a pod won't start and the main container has no logs.
PVC mount failure¶
One more hiding in the events:
5m Warning FailedAttachVolume pod/analytics-worker-1a2b3c4d5-e6f7 Multi-Attach error
for volume "pvc-a1b2c3d4" Volume is already exclusively attached to one node
Multi-Attach error: the PV is ReadWriteOnce (RWO), mountable on only one node. During
a rolling update, the new pod landed on a different node and both pods tried to claim the
volume.
Fixes: use strategy: Recreate instead of RollingUpdate for RWO workloads, switch to
ReadWriteMany (RWX) if storage supports it, or use a StatefulSet.
Part 8: DNS — The Problem That Doesn't Look Like DNS¶
Cart-service is Running and Ready, but logging DNS failures:
2026-03-23T09:22:01Z ERROR Failed to reach api.stripe.com: dial tcp: lookup api.stripe.com: i/o timeout
The debugging ladder for DNS problems:
# 1. Confirm DNS is broken from inside the pod
kubectl exec cart-service-7f8b9c4d5-x2k9m -n production -- nslookup api.stripe.com
# → "connection timed out; no servers could be reached"
# 2. Is CoreDNS running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# → Running, healthy
# 3. Does internal DNS work?
kubectl exec cart-service-7f8b9c4d5-x2k9m -n production -- nslookup rabbitmq.production.svc.cluster.local
# → Resolves to 10.96.45.12
Internal DNS works, external doesn't. This narrows it to CoreDNS upstream config or a
NetworkPolicy blocking DNS egress. Check kubectl get configmap coredns -n kube-system -o yaml
for the forward directive, and check NetworkPolicies — the egress policy from Part 3 that
allows only port 443 would also block DNS (UDP 53) if applied broadly.
Remember: DNS in Kubernetes uses UDP port 53 (and TCP 53 for large responses). Any egress NetworkPolicy that doesn't explicitly allow port 53 to the kube-dns service will break all DNS resolution for affected pods. This is the most common NetworkPolicy footgun.
Trivia: The default
ndotssetting in Kubernetes pod resolv.conf is 5. This means any hostname with fewer than 5 dots gets search domains appended first. A lookup forapi.stripe.com(2 dots) actually triesapi.stripe.com.production.svc.cluster.local, thenapi.stripe.com.svc.cluster.local, thenapi.stripe.com.cluster.local, then finallyapi.stripe.com.— that's 4 DNS queries for one lookup. For external-heavy workloads, settingdnsConfig.options: [{name: ndots, value: "2"}]in the pod spec dramatically reduces DNS query volume.
Part 9: Ephemeral Debug Containers¶
When a container is built from a distroless or scratch image, there's no shell to exec into.
No sh, no bash, no curl, no nslookup. This is where ephemeral debug containers
shine.
# Attach a debug container to a running pod
kubectl debug -it cart-service-7f8b9c4d5-x2k9m -n production \
--image=nicolaka/netshoot --target=cart-service
The --target flag shares the process namespace with the specified container. You can see
its processes, its environment, and its filesystem at /proc/1/root/.
# Inside the debug container:
ps aux # See all processes in the target container
cat /proc/1/root/etc/resolv.conf # Read the target's DNS config
nslookup api.stripe.com # Test DNS from the pod's network namespace
curl -v http://localhost:8080/healthz # Hit the app's health endpoint
For crashing pods (too fast to exec into), copy the pod with the entrypoint replaced:
kubectl debug order-processor-8a9b0c1d2-m3n4 -n production \
-it --copy-to=debug-order --container=order-processor \
--image=order-processor:v3.2.1 -- /bin/sh
For node-level debugging without SSH:
kubectl debug node/worker-2 -it --image=ubuntu
chroot /host # Host filesystem is at /host
journalctl -u kubelet --since "10 minutes ago"
Trivia: Ephemeral debug containers were a long time coming. The feature was introduced as alpha in Kubernetes v1.16 (2019), reached beta in v1.23 (2021), and finally went GA in v1.25 (2022). Before that, debugging distroless containers meant rebuilding the image with debugging tools baked in — which defeats the entire purpose of minimal images.
Part 10: RBAC Errors From Inside Pods¶
One more class of failure you'll hit: the pod runs fine, but fails when it tries to talk to the Kubernetes API.
ERROR: forbidden: User "system:serviceaccount:production:default" cannot list
resource "configmaps" in API group "" in the namespace "production"
The pod is using the default service account, which has no permissions. The application
needs to read ConfigMaps but no one created a Role and RoleBinding.
# Check what service account the pod uses
kubectl get pod notification-svc-4b5c6d7e8-f9a0 -n production \
-o jsonpath='{.spec.serviceAccountName}'
Fix: create a ServiceAccount, a Role granting only the needed permissions, and a
RoleBinding connecting them. Then set serviceAccountName: notification-svc in the pod spec.
# ServiceAccount + Role + RoleBinding (all in namespace: production)
apiVersion: v1
kind: ServiceAccount
metadata:
name: notification-svc
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: notification-svc-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: notification-svc-binding
namespace: production
subjects:
- kind: ServiceAccount
name: notification-svc
roleRef:
kind: Role
name: notification-svc-role
apiGroup: rbac.authorization.k8s.io
Gotcha: If a pod doesn't need to talk to the Kubernetes API at all, set
automountServiceAccountToken: falsein the pod spec. This prevents the service account token from being mounted, which is both a security best practice and eliminates a whole class of RBAC debugging.
The Decision Tree¶
When you encounter a broken pod, start here:
kubectl get pods -o wide
|
v
What is the STATUS?
|
+-- Pending
| → kubectl describe pod → Events → FailedScheduling
| → Check: resource requests, taints, node affinity, PVC binding
|
+-- Init:Error / Init:CrashLoopBackOff
| → kubectl logs <pod> -c <init-container-name>
| → Init containers run sequentially; find which one failed
|
+-- ImagePullBackOff
| → kubectl describe pod → Events → look for 401, 404, or network error
| → Check: image name/tag, imagePullSecrets, registry access
|
+-- CrashLoopBackOff
| → kubectl logs <pod> --previous
| → Check exit code: 137=OOM, 1=app error, 127=cmd not found
| → If no logs: kubectl debug or override entrypoint
|
+-- Running but not Ready
| → Readiness probe failing
| → kubectl describe pod → look for "Readiness probe failed" in Events
|
+-- Running and Ready but not working
| → Test from inside: kubectl exec <pod> -- curl localhost:<port>
| → Check Service endpoints: kubectl get endpoints <svc>
| → Check NetworkPolicy, DNS, upstream dependencies
|
+-- Terminating (stuck)
→ Finalizer blocking deletion
→ kubectl get pod <name> -o jsonpath='{.metadata.finalizers}'
→ Or container ignoring SIGTERM (PID 1 problem)
Cheat Sheet¶
| Symptom | First command | What to look for |
|---|---|---|
| Pod Pending | kubectl describe pod <name> |
FailedScheduling in Events: resources, taints, PVC |
| CrashLoopBackOff | kubectl logs <pod> --previous |
Exit code + error message |
| ImagePullBackOff | kubectl describe pod <name> |
401 (auth), 404 (not found), timeout (network) |
| OOMKilled | kubectl describe pod <name> |
Reason: OOMKilled in Last State, then check limits |
| Probe failures | kubectl describe pod <name> |
Unhealthy events, check probe config |
| No endpoints | kubectl get endpoints <svc> |
Empty = label selector mismatch |
| DNS failure | kubectl exec <pod> -- nslookup kubernetes.default |
If fails: check CoreDNS pods and NetworkPolicy |
| PVC mount fail | kubectl describe pod <name> |
FailedMount or Multi-Attach error |
| RBAC error | kubectl logs <pod> |
forbidden: User "system:serviceaccount:..." |
| Init failure | kubectl logs <pod> -c <init-container> |
Init containers run in order; find the failed one |
The flags you'll type most often: --previous (crash logs), -c <name> (multi-container),
-o wide (node placement), -A (all namespaces), --sort-by=.lastTimestamp (events).
Exercises¶
Exercise 1: Read the room (2 minutes)¶
Given this kubectl get pods output, rank the pods by urgency and explain your reasoning:
NAME READY STATUS RESTARTS AGE
api-v2-abc 0/1 CrashLoopBackOff 12 (4m ago) 1h
worker-def 0/1 Pending 0 3h
cache-ghi 1/1 Running 0 5d
frontend-jkl 0/1 ImagePullBackOff 0 10m
Answer
1. **api-v2**: CrashLoopBackOff, 12 restarts — actively hurting users. `kubectl logs --previous` now. 2. **frontend-jkl**: ImagePullBackOff for 10 minutes — likely a bad deploy. Quick fix (secret/typo). 3. **worker-def**: Pending 3 hours — nobody noticed, probably a background worker. Check scheduling. 4. **cache-ghi**: Running, 5 days, no restarts — fine, but "Running" != "working correctly."Exercise 2: Debug the probe (5 minutes)¶
A pod keeps getting killed every 5 minutes. kubectl describe pod shows repeated Liveness
probe failed events. The probe config is:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 30
timeoutSeconds: 1
failureThreshold: 3
The app takes 2-3 seconds to respond to /healthz under load. Write a better probe
configuration.
Answer
The 1-second timeout is too aggressive. Increase `timeoutSeconds` to 5, add a `startupProbe` with `failureThreshold: 30` for slow starts, and bump `failureThreshold` on liveness to 6. Also: is `/healthz` doing too much work? Liveness endpoints should be lightweight — move dependency checks to a readiness probe.Exercise 3: The resource puzzle (5 minutes)¶
A namespace has a ResourceQuota capping requests.cpu at 8 cores. Current usage: 7 CPUs
requested. You deploy a pod requesting 2 CPUs. The Deployment looks fine, the ReplicaSet
exists, but no pod appears. Where do you look?
Answer
`kubectl describe replicasetTakeaways¶
-
The debugging ladder is a sequence, not a menu. Get, Describe, Logs, Exec, Events, Node. Start at the top.
-
Describe pod, not logs, is your first stop. The Events section tells the infrastructure story — scheduling, image pulls, probes, mounts. Logs tell the application story. Infrastructure first.
-
CrashLoopBackOff exponential backoff caps at 5 minutes. The sequence: 10s, 20s, 40s, 80s, 160s, 300s. Once you fix the issue, the pod recovers on the next cycle.
-
Exit codes above 128 mean "killed by signal." Subtract 128 to get the signal number. 137 = SIGKILL (OOMKilled). 143 = SIGTERM.
-
NetworkPolicy is default-deny once any policy selects a pod. If you add a policy that allows port 443, you've just blocked every other port — including DNS (53).
-
Requests affect scheduling. Limits affect runtime. A pod can be Pending (requests too high) even with an empty cluster, or OOMKilled (limits too low) even with plenty of node memory.
Related Lessons¶
- What Happens When You kubectl apply — the full lifecycle from YAML to running pod
- What Happens When Kubernetes Evicts Your Pod — node pressure, QoS classes, eviction thresholds
- Kubernetes Services: How Traffic Finds Your Pod — endpoints, kube-proxy, and the network path
- Why DNS Is Always the Problem — CoreDNS, ndots, search domains
- Connection Refused — differential diagnosis across the stack
- Out of Memory — OOM killer, cgroups, and memory management
Pages that link here¶
- Connection Refused
- Cross-Domain Lessons
- Kubernetes From Scratch To Production Upgrade
- Kubernetes Services How Traffic Finds Your Pod
- Linux Storage Lvm Filesystems And Beyond
- Out Of Memory
- Python Automating Everything Apis And Infrastructure
- What Happens When Kubernetes Evicts Your Pod
- What Happens When You Kubectl Apply
- Why Dns Is Always The Problem