Decision Tree: Service Returning 5xx Errors¶

Category: Incident Triage Starting Question: "My service is returning 5xx errors — where do I start?" Estimated traversal: 2-5 minutes Domains: kubernetes, networking, observability, linux-performance

The Tree¶

My service is returning 5xx errors — where do I start?
│
├── What is the error code?
│   │
│   ├── 502 / 503 / 504 (Gateway errors — proxy/ingress/LB is failing)
│   │   │
│   │   ├── Is it ALL traffic or only specific paths?
│   │   │   │
│   │   │   ├── All traffic
│   │   │   │   │
│   │   │   │   ├── Check pod count: `kubectl get pods -l app=<svc>`
│   │   │   │   │   │
│   │   │   │   │   ├── 0 running pods → ✅ ACTION: Fix Deployment (scale up / fix CrashLoop)
│   │   │   │   │   │
│   │   │   │   │   └── Pods exist but not Ready
│   │   │   │   │       │
│   │   │   │   │       ├── Check readiness probe: `kubectl describe pod <pod>`
│   │   │   │   │       │   └── Probe failing → ✅ ACTION: Fix Readiness Probe or App Startup
│   │   │   │   │       │
│   │   │   │   │       └── Service selector mismatch?
│   │   │   │   │           `kubectl get endpoints <svc>`
│   │   │   │   │           └── Empty ENDPOINTS → ✅ ACTION: Fix Service Selector
│   │   │   │   │
│   │   │   │   └── Is this a 504 specifically? (timeout)
│   │   │   │       │
│   │   │   │       ├── Check upstream latency vs configured timeout
│   │   │   │       │   `kubectl get ingress <name> -o yaml | grep timeout`
│   │   │   │       │   └── Latency > timeout → ✅ ACTION: Increase Timeout or Fix Upstream Latency
│   │   │   │       │
│   │   │   │       └── Check downstream dependency (DB / cache)
│   │   │   │           `kubectl exec -it <pod> -- curl -v http://db-service:5432`
│   │   │   │           └── Dependency unreachable → go to dependency branch
│   │   │   │
│   │   │   └── Specific paths only
│   │   │       │
│   │   │       ├── Check ingress routing rules
│   │   │       │   `kubectl get ingress <name> -o yaml`
│   │   │       │   └── Path not defined or wrong serviceName → ✅ ACTION: Fix Ingress Rule
│   │   │       │
│   │   │       └── Check if path-specific service exists and has endpoints
│   │   │           `kubectl get svc,endpoints -n <namespace>`
│   │   │           └── Missing service → ✅ ACTION: Deploy Missing Service
│   │   │
│   │   └── Check ingress controller logs
│   │       `kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100`
│   │       └── Upstream connection refused / reset → upstream pods are crashing
│   │           → See 500 branch below
│   │
│   └── 500 (Application error — the app itself is failing)
│       │
│       ├── Check application logs for exceptions
│       │   `kubectl logs <pod> --tail=200 | grep -i "error\|exception\|panic\|fatal"`
│       │   │
│       │   ├── Exception found with stack trace → ✅ ACTION: Fix Application Bug / Roll Back
│       │   │
│       │   ├── No logs at all → check if pod is restarting
│       │   │   `kubectl get pod <pod> -w`
│       │   │   └── Restart count rising → see pod-wont-start.md
│       │   │
│       │   └── "Connection refused" / timeout to dependency
│       │       │
│       │       ├── Is it DB? → Check DB health
│       │       │   `kubectl exec -it <pod> -- psql -h $DB_HOST -U $DB_USER -c '\l'`
│       │       │   └── Cannot connect → ✅ ACTION: Fix DB Connectivity / Connection Pool
│       │       │
│       │       └── Is it Redis/cache? → Check cache health
│       │           `kubectl exec -it <pod> -- redis-cli -h $REDIS_HOST ping`
│       │           └── No PONG → ✅ ACTION: Fix Cache Connectivity
│       │
│       ├── Was there a recent deployment?
│       │   `kubectl rollout history deployment/<name>`
│       │   │
│       │   ├── Yes, deployed in last 60 min → ✅ ACTION: Roll Back Deployment
│       │   │
│       │   └── No recent deploy → check config / secret changes
│       │       `kubectl describe deployment <name> | grep -A5 "Environment"`
│       │       └── Wrong env var / missing secret → ✅ ACTION: Fix ConfigMap / Secret
│       │
│       └── Are errors correlated with high load?
│           Check CPU/memory: `kubectl top pod -l app=<svc>`
│           │
│           ├── CPU throttling or memory near limit
│           │   └── ✅ ACTION: Scale Horizontally or Increase Limits
│           │
│           └── Resources look fine
│               └── ⚠️ ESCALATION: Engage App Owner for deeper profiling

Node Details¶

Check 1: Identify the error code¶

Command: kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200 | grep -E " 5[0-9]{2} " or check your APM/metrics dashboard for status code breakdown. What you're looking for: Whether errors are 502/503/504 (proxy layer, upstream unreachable) vs 500/5xx (application layer, upstream reachable but returning error). Common pitfall: A 503 from nginx-ingress means nginx cannot reach any backend pod — not that the app returned 503. Look at the ingress controller logs, not just the app logs.

Check 2: Pod count and readiness¶

Command: kubectl get pods -l app=<service-name> -n <namespace> then kubectl get endpoints <service-name> -n <namespace> What you're looking for: ENDPOINTS should list at least one ip:port. An empty endpoints list means the service has no ready pods to route to. Common pitfall: Pods may show Running but not Ready (1/1 vs 0/1 in the READY column). Check kubectl describe pod for the readiness probe failure message.

Check 3: Service selector mismatch¶

Command: kubectl get svc <name> -o jsonpath='{.spec.selector}' then kubectl get pods -l <key>=<value> What you're looking for: The selector labels on the Service must match labels on the pods. A single typo means zero endpoints. Common pitfall: A deployment rollout can change pod labels (e.g., version label) without updating the service selector.

Check 4: Application logs¶

Command: kubectl logs <pod-name> --previous --tail=300 (use --previous if the current container just started after a crash) What you're looking for: Stack traces, "connection refused", "timeout", "out of memory", "nil pointer dereference", or authentication errors to dependencies. Common pitfall: Multi-container pods — specify -c <container> or you get the first container's logs by default, which may be a sidecar.

Check 5: Recent deployment¶

Command: kubectl rollout history deployment/<name> and kubectl describe deployment <name> | grep "last-applied" What you're looking for: Timestamp of most recent rollout vs onset of 5xx errors. Common pitfall: Canary or partial rollouts — some pods may run old code and some new. Compare logs from old vs new pods by checking their creationTimestamp.

Check 6: DB/cache connectivity¶

Command: kubectl exec -it <app-pod> -- sh -c 'nc -zv $DB_HOST 5432; echo exit=$?' What you're looking for: exit=0 means network connectivity is fine; non-zero means the DB is unreachable from this pod. Common pitfall: NetworkPolicy rules may be blocking the new pod's IP if it was recently rescheduled to a different node. See networkpolicy_block.md.

Terminal Actions¶

Action: Fix Service Selector¶

Do: 1. kubectl get svc <name> -o yaml > /tmp/svc-backup.yaml (save backup) 2. Identify correct pod labels: kubectl get pods -l app=<name> --show-labels 3. kubectl patch svc <name> -p '{"spec":{"selector":{"app":"<correct-value>"}}}' 4. Confirm: kubectl get endpoints <name> Verify: Endpoints list shows pod IPs. Run a test request: curl -v http://<service-ip>/healthz Runbook: ingress_404.md

Action: Roll Back Deployment¶

Do: 1. kubectl rollout undo deployment/<name> 2. Monitor: kubectl rollout status deployment/<name> 3. Confirm error rate drops in metrics dashboard Verify: kubectl get pods -l app=<name> shows new pods all Ready; 5xx rate returns to baseline. Runbook: helm_upgrade_failed.md

Action: Fix Readiness Probe or App Startup¶

Do: 1. kubectl describe pod <pod> — find the readiness probe definition and last failure message 2. Test the probe manually: kubectl exec -it <pod> -- curl -v http://localhost:<port><path> 3. If probe path is wrong, edit deployment: kubectl edit deployment <name> 4. If app is slow to start, increase initialDelaySeconds in the probe spec Verify: Pod transitions to Ready (1/1). kubectl get pods -l app=<name> Runbook: readiness_probe_failed.md

Action: Increase Timeout or Fix Upstream Latency¶

Do: 1. Identify current timeout: kubectl get ingress <name> -o yaml | grep -i timeout 2. Annotate ingress: kubectl annotate ingress <name> nginx.ingress.kubernetes.io/proxy-read-timeout="120" 3. Investigate why upstream is slow — check latency metrics and slow query logs Verify: 504 rate drops. Upstream p99 latency is within timeout. Runbook: hpa_not_scaling.md

Action: Fix DB Connectivity / Connection Pool¶

Do: 1. Verify DB pod/service is up: kubectl get pods -n <db-namespace> 2. Check connection pool exhaustion in app logs: look for "too many connections" or "pool timeout" 3. If pool exhausted: restart app pods to drain idle connections, then tune max_pool_size 4. If DB is down: see PostgreSQL runbook Verify: App logs show successful DB connections. 500 error rate drops.

Action: Scale Horizontally or Increase Limits¶

Do: 1. Check current HPA: kubectl get hpa 2. If no HPA, manual scale: kubectl scale deployment <name> --replicas=<N> 3. If CPU-throttled, increase limits: kubectl set resources deployment <name> --limits=cpu=500m,memory=512Mi Verify: kubectl top pods -l app=<name> shows CPU/memory below 80% of limits.

Escalation: Engage App Owner for Deeper Profiling¶

When: 500 errors persist after rollback, resources are healthy, logs show no clear root cause. Who: Application development team / service owner Include in page: Error rate (req/s), onset time, pod names currently affected, last 50 lines of app logs, recent deployment history

Edge Cases¶

Intermittent 5xx on only 1 of N pods: One pod may have a corrupted local state or a different code version. Cordon and delete the specific pod rather than rolling back the whole deployment.
5xx only during peak traffic: Connection pool or file descriptor exhaustion. Check ulimit -n inside the pod and DB max_connections.
5xx after cert rotation: If your app validates TLS to a dependency, a new cert may not be trusted. Check cert_renewal_failed.md.
5xx from a Helm-deployed service after upgrade: Helm may have partially applied — check helm status <release> for failed hooks. See helm_upgrade_failed.md.
503 from service mesh (Istio): Envoy sidecar returning 503 is distinct from the app returning 503. Check istioctl proxy-status and kubectl logs <pod> -c istio-proxy.

Cross-References¶

Topic Packs: k8s-services-and-ingress, k8s-ops (Probes), observability-deep-dive, networking
Runbooks: ingress_404.md, helm_upgrade_failed.md, crashloopbackoff.md, networkpolicy_block.md, cert_renewal_failed.md