Skip to content

Thinking Out Loud: Kubernetes Debugging

A senior SRE's internal monologue while working through a real Kubernetes debugging task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

PagerDuty fires at 2:47 AM: "payment-service 5xx rate > 5% for 3 minutes." The on-call dashboard shows the error rate spiking from 0.1% to 12% in the last four minutes. I'm on call this week for the payments team.

The Monologue

Okay, 2:47 AM. Let me not panic. First thing — is this actually real or is the monitoring flapping? Let me check the alert history... no, this is the first fire tonight. The 5xx rate is climbing. This is real.

Let me see what the pods look like right now.

kubectl get pods -n payments -l app=payment-service -o wide

Four pods, two are in CrashLoopBackOff. Okay, that explains the 5xx spike — half the fleet is down, so the remaining pods are probably getting hammered and the load balancer is still sending health checks to the crashing ones during the backoff window.

Let me check when this started.

kubectl get events -n payments --sort-by=.lastTimestamp | tail -20

Events show the first crash was at 2:43 AM. That's four minutes ago, lines up with the alert. Now — was there a deploy? That's the first question I always ask when pods start crashing.

kubectl rollout history deployment/payment-service -n payments

Revision 47 was deployed at... 2:41 AM. There it is. Someone pushed a deploy two minutes before the crashes started. Classic.

Mental Model: The "What Changed?" Heuristic

When something breaks, the most likely cause is the most recent change. Before deep-diving into application internals, always check: was there a deploy, a config change, a node event, or a scaling event in the last 10 minutes?

Let me look at the logs from one of the crashing pods.

kubectl logs payment-service-7f8b9c4d5-x2k9m -n payments --previous --tail=50

Ah — FATAL: database connection refused: ECONNREFUSED 10.96.45.12:5432. The app is crashing because it can't reach Postgres. But wait, if the DB was down, ALL pods would be crashing, not just two. Let me check the other pods.

kubectl logs payment-service-7f8b9c4d5-a8j3n -n payments --tail=10

The healthy pods are connecting to the DB just fine. So the DB is up. Then why are these two pods failing?

Wait. Let me look at what node those crashing pods are on.

kubectl get pods -n payments -l app=payment-service -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,RESTARTS:.status.containerStatuses[0].restartCount'

Both crashing pods are on node-worker-3. The healthy ones are on node-worker-1 and node-worker-2. This isn't an app problem — it's a node problem. The node can't reach the DB.

Mental Model: Blast Radius Mapping

When some pods crash but not all, the differentiator is almost never the app code. It's the environment: which node, which AZ, which network path. Map the failure to the infrastructure topology before debugging the application.

Let me check the node.

kubectl describe node node-worker-3 | grep -A 5 Conditions

Node shows Ready, but let me check if there's a network issue. I've been burned before where the node shows Ready but has a degraded network interface — the kubelet heartbeat works because it uses a different path than the pod network.

kubectl exec -it payment-service-7f8b9c4d5-a8j3n -n payments -- nc -zv 10.96.45.12 5432

That works from a healthy pod. Now let me test from a debug pod on the bad node.

kubectl run debug-net --image=nicolaka/netshoot --overrides='{"spec":{"nodeName":"node-worker-3"}}' --rm -it -- nc -zv 10.96.45.12 5432

Connection timed out. So the problem is confirmed — node-worker-3 can't reach the ClusterIP for the Postgres service. This is a CNI or kube-proxy issue on that specific node.

Actually, wait. I almost went down a rabbit hole debugging CNI. Let me first just fix the user impact and THEN debug. I'm going to cordon the node and let the pods reschedule.

kubectl cordon node-worker-3
kubectl delete pod payment-service-7f8b9c4d5-x2k9m payment-service-7f8b9c4d5-q4r7p -n payments

Mental Model: Mitigate First, Diagnose Second

In an incident, your first job is to stop the bleeding. Cordon the bad node, roll back the bad deploy, scale up healthy capacity. Root cause analysis happens after users stop getting errors. I've seen too many incidents where an engineer spent 20 minutes debugging while the error rate climbed because they wanted to understand before acting.

Good, new pods are scheduling on worker-1 and worker-2. Let me watch the rollout.

kubectl get pods -n payments -l app=payment-service -w

All four pods Running and Ready now. Let me check the error rate in Grafana... dropping. Down to 0.3% and falling. Good. Incident mitigated.

Now, back to worker-3. What's actually wrong? Let me SSH in and check kube-proxy.

ssh node-worker-3 'sudo iptables -t nat -L KUBE-SERVICES | head -20'

Hmm, the iptables rules look sparse. Let me check if kube-proxy is running.

ssh node-worker-3 'sudo systemctl status kube-proxy'

Kube-proxy is running but... last restarted 6 minutes ago. And checking the journal — it restarted because it OOM-killed. When kube-proxy restarts, there's a brief window where iptables rules are being reprogrammed. On a cluster this size with a lot of services, that can take a few seconds where ClusterIP routing doesn't work.

Actually, I just remembered — we bumped the service count last week when the platform team onboarded their 40 new microservices. More services means more iptables rules means more memory for kube-proxy. I bet the memory limit for kube-proxy is still set to the old value.

kubectl get ds kube-proxy -n kube-system -o jsonpath='{.spec.template.spec.containers[0].resources}'

Yep. Memory limit is 128Mi. With 400+ services, kube-proxy needs more. I'll file a ticket to bump this, but for now I need to uncordon the node once kube-proxy has fully synced its rules.

Let me verify the rules are complete, then uncordon.

ssh node-worker-3 'sudo iptables -t nat -L KUBE-SERVICES | wc -l'
kubectl uncordon node-worker-3

Rules look complete now. Node uncordoned. Let me write up the incident notes while this is fresh in my head. The root cause was kube-proxy OOM on worker-3 due to increased service count, causing intermittent ClusterIP routing failures for pods on that node.

One more thing — I always set a reminder to check this tomorrow. Incidents at 3 AM get forgotten by the standup if you don't write them down.

What Made This Senior-Level

Junior Would... Senior Does... Why
Look at application code first when pods crash Check deploy history and infrastructure topology first Most crashes after a deploy are caused by the deploy; most partial failures are environmental, not code
Debug the crashing pods for 20 minutes while users get errors Cordon the bad node and let pods reschedule immediately Mitigate first, diagnose second — user impact is the clock you're racing
Assume CrashLoopBackOff means the app has a bug Notice that only pods on one node are crashing and pivot to node-level debugging Blast radius mapping eliminates whole categories of root causes
Not connect the kube-proxy OOM to the recent service count increase Correlate the resource limits with the recent infrastructure change Experienced engineers maintain a mental timeline of recent changes across teams

Key Heuristics Used

  1. What Changed?: When something breaks, the most recent change is the most likely cause — check deploys, config changes, and scaling events within the last 10 minutes before deep-diving.
  2. Blast Radius Mapping: When only some replicas fail, the differentiator is the environment (node, AZ, network path), not the application code.
  3. Mitigate First, Diagnose Second: Stop user impact before root-causing — cordon nodes, roll back deploys, scale up healthy capacity, then investigate.

Cross-References

  • Primer — The foundational debugging framework and failure cascade model
  • Street Ops — The pod failure triage flowchart and decision trees used here
  • Footguns — Common debugging traps including the "debug before mitigate" anti-pattern