Skip to content

Troubleshooting Flows Cheat Sheet

Quick decision trees for common DevOps problems. Follow the arrows.

Remember: Exit codes tell you how a container died. The critical ones: 0 = clean exit, 1 = application error, 137 = OOMKilled or SIGKILL (128+9), 139 = segfault (128+11), 143 = SIGTERM (128+15). Any exit code > 128 means the process was killed by a signal: subtract 128 to get the signal number.

Pod Not Starting

Pod status?
├── Pending
   ├── "Insufficient cpu/memory"  Scale nodes or reduce requests
   ├── "No nodes match"  Check nodeSelector, tolerations, affinity
   ├── "PVC not bound"  Check StorageClass, PV availability
   └── "Unschedulable"  Check taints, cordoned nodes
├── ContainerCreating
   ├── Stuck mounting  Check PVC, CSI driver, node storage
   └── Stuck pulling  Check image name, registry auth, network
├── CrashLoopBackOff
   ├── Exit code 1  App error: kubectl logs --previous
   ├── Exit code 137  OOMKilled: increase memory limit
   ├── Exit code 139  Segfault: debug binary/dependencies
   └── Immediate crash  Wrong CMD, missing config/env
├── ImagePullBackOff
   ├── 401 Unauthorized  Check imagePullSecrets
   ├── Not found  Check image:tag spelling
   └── Timeout  Check network/registry availability
└── Running but not Ready
    └── Readiness probe failing  Check probe path, port, app health

Service Not Reachable

kubectl get endpoints <svc>
├── Endpoints empty?
   ├── Yes  Labels don't match: compare svc selector with pod labels
   └── No  Endpoints exist, continue...
kubectl exec debug-pod -- curl <svc>:<port>
├── Connection refused?
   └── App not listening on that port: kubectl exec <pod> -- ss -tlnp
├── Connection timed out?
   ├── NetworkPolicy blocking?  kubectl get netpol -n <ns>
   └── Wrong port in Service spec?  Check targetPort
├── DNS not resolving?
   ├── CoreDNS running?  kubectl get pods -n kube-system -l k8s-app=kube-dns
   └── Check resolv.conf  kubectl exec <pod> -- cat /etc/resolv.conf
└── Works internally but not externally?
    ├── Ingress misconfigured  Check rules, host, path
    ├── LoadBalancer pending  Check cloud provider, annotations
    └── NodePort  Check security groups allow the port

High Error Rate

Where are errors?
├── Application level (5xx)
   ├── Check app logs: kubectl logs <pod> --tail=100
   ├── Recent deployment?  kubectl rollout history deploy/<name>
   ├── Resource pressure?  kubectl top pods
   └── Dependency down?  Check downstream services
├── Infrastructure level
   ├── Node not ready  kubectl describe node <name>
   ├── Disk pressure  df -h on node
   ├── Memory pressure  free -h on node
   └── Network issues  Check CNI, kube-proxy
└── Ingress/Load Balancer
    ├── 502/503  Backend pods not ready or restarting
    ├── 504  Timeout: upstream too slow, increase timeout annotation
    └── 429  Rate limiting: check ingress rate-limit annotations

Debug clue: When facing 502/503 errors at an Ingress/Load Balancer, always check kubectl get pods -w — if pods are rapidly restarting (CrashLoopBackOff), the LB sees healthy → unhealthy → healthy cycling, which produces intermittent 502s. The fix is in the application, not the load balancer.

High Latency

Where is latency?
├── At the app?
   ├── CPU throttled?  Check limits vs usage (kubectl top)
   ├── GC pauses?  Check app-specific metrics
   ├── DB queries slow?  Check connection pool, query patterns
   └── External API slow?  Check dependency latency metrics
├── At the network?
   ├── Cross-AZ traffic?  Check pod/service topology
   ├── DNS slow?  Time DNS lookups, check ndots setting
   ├── Service mesh overhead?  Check Envoy sidecar metrics
   └── kube-proxy / iptables?  Check rule count, IPVS mode
└── At the infrastructure?
    ├── Disk I/O saturated?  iostat, check PV performance
    ├── Node overcommitted?  kubectl describe node (allocated %)
    └── etcd slow?  Check etcd latency metrics

Deployment Rollback Decision

Deployment failing?
├── New pods not starting?
   ├── Image issue  Fix image, redeploy
   └── Config issue  Fix configmap/secret, redeploy
├── New pods starting but unhealthy?
   ├── Canary/progressive?  Route back to stable
   └── Rolling update?  kubectl rollout undo deploy/<name>
├── Partial rollout stuck?
   ├── maxUnavailable reached  Old pods can't terminate
   └── PDB blocking  Check PodDisruptionBudget
└── Helm release failed?
    ├── helm rollback <release> <revision>
    ├── Check: helm history <release>
    └── Stuck pending?  kubectl delete secret -l status=pending-install

Certificate Issues

TLS error?
├── Certificate expired
   ├── Check expiry: openssl s_client -connect host:443 | openssl x509 -noout -dates
   ├── cert-manager?  Check Certificate and CertificateRequest resources
   └── Manual?  Renew and redeploy secret
├── Chain incomplete
   ├── Missing intermediate cert  Add to tls.crt in correct order
   └── openssl verify -CAfile ca.pem -untrusted intermediate.pem cert.pem
├── Name mismatch
   ├── CN/SAN doesn't match hostname  Reissue with correct names
   └── Check: openssl x509 -in cert.pem -noout -text | grep -A1 "Subject Alternative Name"
└── Connection reset/timeout
    └── TLS not enabled on backend  Check service port name (must include "https")

Cost Spike Investigation

Bill increased?
├── Which service?
│   ├── Compute → Check instance count, types, idle resources
│   ├── Data transfer → NAT Gateway, cross-AZ, egress
│   ├── Storage → EBS snapshots, S3 growth, orphaned volumes
│   └── Database → Oversized RDS, read replicas, backup retention
├── When did it start?
│   ├── Correlate with deployments/scaling events
│   └── Check auto-scaling groups — did max increase?
└── Quick wins
    ├── Delete orphaned resources (unattached EBS, unused EIPs)
    ├── Right-size underutilized instances
    ├── Schedule dev/test shutdowns
    └── Enable S3 lifecycle policies

Gotcha: Kubernetes DNS uses ndots: 5 by default, which means any name with fewer than 5 dots gets the search domains appended first. A lookup for api.example.com (2 dots) generates 4-5 DNS queries before the final FQDN query succeeds. For latency-sensitive services, add dnsConfig.options: [{name: ndots, value: "2"}] to the pod spec, or use FQDNs with a trailing dot (api.example.com.).

Node Issues

Node NotReady?
├── Check conditions: kubectl describe node <name>
   ├── MemoryPressure  Evict pods, add memory
   ├── DiskPressure  Clean up images/logs
   ├── PIDPressure  Find process leak
   └── NetworkUnavailable  Check CNI plugin
├── kubelet not running?
   ├── systemctl status kubelet
   ├── journalctl -u kubelet --since "5 min ago"
   └── Check certificates: ls -la /var/lib/kubelet/pki/
├── Node unreachable?
   ├── Instance running? (cloud console)
   ├── Network: security groups, NACLs
   └── BMC/IPMI: check hardware health
└── Recovery
    ├── Drain: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
    ├── Fix issue on node
    └── Uncordon: kubectl uncordon <node>