Troubleshooting Flows Cheat Sheet¶
Quick decision trees for common DevOps problems. Follow the arrows.
Remember: Exit codes tell you how a container died. The critical ones: 0 = clean exit, 1 = application error, 137 = OOMKilled or SIGKILL (128+9), 139 = segfault (128+11), 143 = SIGTERM (128+15). Any exit code > 128 means the process was killed by a signal: subtract 128 to get the signal number.
Pod Not Starting¶
Pod status?
├── Pending
│ ├── "Insufficient cpu/memory" → Scale nodes or reduce requests
│ ├── "No nodes match" → Check nodeSelector, tolerations, affinity
│ ├── "PVC not bound" → Check StorageClass, PV availability
│ └── "Unschedulable" → Check taints, cordoned nodes
├── ContainerCreating
│ ├── Stuck mounting → Check PVC, CSI driver, node storage
│ └── Stuck pulling → Check image name, registry auth, network
├── CrashLoopBackOff
│ ├── Exit code 1 → App error: kubectl logs --previous
│ ├── Exit code 137 → OOMKilled: increase memory limit
│ ├── Exit code 139 → Segfault: debug binary/dependencies
│ └── Immediate crash → Wrong CMD, missing config/env
├── ImagePullBackOff
│ ├── 401 Unauthorized → Check imagePullSecrets
│ ├── Not found → Check image:tag spelling
│ └── Timeout → Check network/registry availability
└── Running but not Ready
└── Readiness probe failing → Check probe path, port, app health
Service Not Reachable¶
kubectl get endpoints <svc>
├── Endpoints empty?
│ ├── Yes → Labels don't match: compare svc selector with pod labels
│ └── No → Endpoints exist, continue...
│
kubectl exec debug-pod -- curl <svc>:<port>
├── Connection refused?
│ └── App not listening on that port: kubectl exec <pod> -- ss -tlnp
├── Connection timed out?
│ ├── NetworkPolicy blocking? → kubectl get netpol -n <ns>
│ └── Wrong port in Service spec? → Check targetPort
├── DNS not resolving?
│ ├── CoreDNS running? → kubectl get pods -n kube-system -l k8s-app=kube-dns
│ └── Check resolv.conf → kubectl exec <pod> -- cat /etc/resolv.conf
└── Works internally but not externally?
├── Ingress misconfigured → Check rules, host, path
├── LoadBalancer pending → Check cloud provider, annotations
└── NodePort → Check security groups allow the port
High Error Rate¶
Where are errors?
├── Application level (5xx)
│ ├── Check app logs: kubectl logs <pod> --tail=100
│ ├── Recent deployment? → kubectl rollout history deploy/<name>
│ ├── Resource pressure? → kubectl top pods
│ └── Dependency down? → Check downstream services
├── Infrastructure level
│ ├── Node not ready → kubectl describe node <name>
│ ├── Disk pressure → df -h on node
│ ├── Memory pressure → free -h on node
│ └── Network issues → Check CNI, kube-proxy
└── Ingress/Load Balancer
├── 502/503 → Backend pods not ready or restarting
├── 504 → Timeout: upstream too slow, increase timeout annotation
└── 429 → Rate limiting: check ingress rate-limit annotations
Debug clue: When facing 502/503 errors at an Ingress/Load Balancer, always check
kubectl get pods -w— if pods are rapidly restarting (CrashLoopBackOff), the LB sees healthy → unhealthy → healthy cycling, which produces intermittent 502s. The fix is in the application, not the load balancer.
High Latency¶
Where is latency?
├── At the app?
│ ├── CPU throttled? → Check limits vs usage (kubectl top)
│ ├── GC pauses? → Check app-specific metrics
│ ├── DB queries slow? → Check connection pool, query patterns
│ └── External API slow? → Check dependency latency metrics
├── At the network?
│ ├── Cross-AZ traffic? → Check pod/service topology
│ ├── DNS slow? → Time DNS lookups, check ndots setting
│ ├── Service mesh overhead? → Check Envoy sidecar metrics
│ └── kube-proxy / iptables? → Check rule count, IPVS mode
└── At the infrastructure?
├── Disk I/O saturated? → iostat, check PV performance
├── Node overcommitted? → kubectl describe node (allocated %)
└── etcd slow? → Check etcd latency metrics
Deployment Rollback Decision¶
Deployment failing?
├── New pods not starting?
│ ├── Image issue → Fix image, redeploy
│ └── Config issue → Fix configmap/secret, redeploy
├── New pods starting but unhealthy?
│ ├── Canary/progressive? → Route back to stable
│ └── Rolling update? → kubectl rollout undo deploy/<name>
├── Partial rollout stuck?
│ ├── maxUnavailable reached → Old pods can't terminate
│ └── PDB blocking → Check PodDisruptionBudget
└── Helm release failed?
├── helm rollback <release> <revision>
├── Check: helm history <release>
└── Stuck pending? → kubectl delete secret -l status=pending-install
Certificate Issues¶
TLS error?
├── Certificate expired
│ ├── Check expiry: openssl s_client -connect host:443 | openssl x509 -noout -dates
│ ├── cert-manager? → Check Certificate and CertificateRequest resources
│ └── Manual? → Renew and redeploy secret
├── Chain incomplete
│ ├── Missing intermediate cert → Add to tls.crt in correct order
│ └── openssl verify -CAfile ca.pem -untrusted intermediate.pem cert.pem
├── Name mismatch
│ ├── CN/SAN doesn't match hostname → Reissue with correct names
│ └── Check: openssl x509 -in cert.pem -noout -text | grep -A1 "Subject Alternative Name"
└── Connection reset/timeout
└── TLS not enabled on backend → Check service port name (must include "https")
Cost Spike Investigation¶
Bill increased?
├── Which service?
│ ├── Compute → Check instance count, types, idle resources
│ ├── Data transfer → NAT Gateway, cross-AZ, egress
│ ├── Storage → EBS snapshots, S3 growth, orphaned volumes
│ └── Database → Oversized RDS, read replicas, backup retention
├── When did it start?
│ ├── Correlate with deployments/scaling events
│ └── Check auto-scaling groups — did max increase?
└── Quick wins
├── Delete orphaned resources (unattached EBS, unused EIPs)
├── Right-size underutilized instances
├── Schedule dev/test shutdowns
└── Enable S3 lifecycle policies
Gotcha: Kubernetes DNS uses
ndots: 5by default, which means any name with fewer than 5 dots gets the search domains appended first. A lookup forapi.example.com(2 dots) generates 4-5 DNS queries before the final FQDN query succeeds. For latency-sensitive services, adddnsConfig.options: [{name: ndots, value: "2"}]to the pod spec, or use FQDNs with a trailing dot (api.example.com.).
Node Issues¶
Node NotReady?
├── Check conditions: kubectl describe node <name>
│ ├── MemoryPressure → Evict pods, add memory
│ ├── DiskPressure → Clean up images/logs
│ ├── PIDPressure → Find process leak
│ └── NetworkUnavailable → Check CNI plugin
├── kubelet not running?
│ ├── systemctl status kubelet
│ ├── journalctl -u kubelet --since "5 min ago"
│ └── Check certificates: ls -la /var/lib/kubelet/pki/
├── Node unreachable?
│ ├── Instance running? (cloud console)
│ ├── Network: security groups, NACLs
│ └── BMC/IPMI: check hardware health
└── Recovery
├── Drain: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
├── Fix issue on node
└── Uncordon: kubectl uncordon <node>