- k8s
- l2
- runbook
- k8s-networking
- k8s-services-ingress --- Portal | Level: L2: Operations | Topics: Kubernetes Networking, Kubernetes Services & Ingress | Domain: Kubernetes
Runbook: Ingress 502 Bad Gateway¶
| Field | Value |
|---|---|
| Domain | Kubernetes |
| Alert | nginx_ingress_controller_requests{status="502"} > threshold or user reports |
| Severity | P1 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 20 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | kubectl access, cluster-admin or namespace-admin, kubeconfig configured |
Quick Assessment (30 seconds)¶
If output shows: multiple ingresses affected → The ingress controller itself may be down — jump to Step 2 immediately If output shows: a single ingress returning 502 → Continue with steps below to narrow down to backend or ingress configStep 1: Confirm 502 Scope — All Paths or Specific¶
Why: A 502 on one path means one backend service is broken. A 502 on all paths of an ingress usually means the ingress controller cannot reach any backend, which is a different problem.
# Test from inside the cluster to isolate whether it is the ingress or backend
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>
# Check the ingress rules to see which services are backend
kubectl describe ingress <INGRESS_NAME> -n <NAMESPACE>
Rules:
Host Path Backends
---- ---- --------
api.example.com
/users user-service:8080 (10.0.1.5:8080)
/orders order-service:8080 (No Endpoints!) <-- problem here
No Endpoints: The backend service has no ready pods — go to Step 4.
If all paths are affected: Go to Step 2 to check the ingress controller itself.
Step 2: Check Ingress Controller Logs¶
Why: The NGINX ingress controller logs the exact reason for each 502, including upstream connection refusals, timeouts, and connect errors. This is the fastest path to root cause.
# Find the ingress controller pod
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Tail the logs and filter for 502s and errors
kubectl logs -n ingress-nginx <INGRESS_CONTROLLER_POD> --tail=100 | grep -E "502|upstream|error|connect"
# If there are multiple controller pods, check all of them
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50 | grep -E "502|upstream|error"
2026/03/19 10:00:00 [error] 29#29: *1234 connect() failed (111: Connection refused) while connecting to upstream,
client: 10.0.0.1, server: api.example.com, request: "GET /orders HTTP/1.1", upstream: "http://10.0.1.8:8080/orders"
connect() failed (111: Connection refused): The backend pod is up but the port is wrong, or the pod is not listening on that port — continue to Step 3.
If logs show upstream timed out: The backend is reachable but slow — check application performance and see Step 6 for timeout tuning.
If the ingress controller pod is not running: Restart it with kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx.
Step 3: Check Backend Service Endpoints¶
Why: A Kubernetes Service with no Endpoints means no ready pods are backing it. The ingress controller returns 502 when it cannot forward to any endpoint.
# Check if the Service has endpoints
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
# More detail on the service
kubectl describe service <SERVICE_NAME> -n <NAMESPACE>
<none>: The Service's selector does not match any running pods — continue to Step 4.
If endpoints exist but 502 still occurs: The pods are registered but not accepting connections — check pod readiness (Step 4) and port mappings.
Step 4: Check Pod Readiness¶
Why: Kubernetes removes pods from Service endpoints when their readiness probe fails. If all pods fail readiness, the Service has no endpoints and all traffic returns 502. This is distinct from pods being Down — they may be running but not ready.
# Check pod readiness status
kubectl get pods -n <NAMESPACE> -l <SERVICE_SELECTOR_LABEL>=<SERVICE_SELECTOR_VALUE> -o wide
# Check readiness probe configuration
kubectl describe deployment <DEPLOYMENT_NAME> -n <NAMESPACE> | grep -A 10 "Readiness"
# Check recent events for readiness failures
kubectl get events -n <NAMESPACE> --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20
Step 5: Check Upstream Service Health¶
Why: The backend pods may be healthy, but if they depend on a database, cache, or third-party API that is down, they will return errors that manifest as 502s.
# Test the backend service directly, bypassing the ingress
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl -v http://<SERVICE_NAME>.<NAMESPACE>.svc.cluster.local:<PORT>/<PATH>
# Check if the service's own dependencies are up
kubectl get pods -n <DEPENDENCY_NAMESPACE> | grep <DEPENDENCY_APP>
# Check logs for dependency connection errors
kubectl logs <POD_NAME> -n <NAMESPACE> --tail=100 | grep -iE "connect|refused|timeout|database|redis|unavailable"
Step 6: Check Proxy Timeouts¶
Why: If the backend service is slow (e.g., due to a slow database query or large payload), the default NGINX proxy timeouts (60s for read, 60s for connect) may be too short, causing NGINX to abort the connection and return 502 before the backend responds.
# Check current ingress annotations for timeout settings
kubectl get ingress <INGRESS_NAME> -n <NAMESPACE> -o yaml | grep -i timeout
# Common timeout annotations for nginx ingress:
# nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
# Patch if timeouts are too short (example: extend read timeout to 120s)
kubectl annotate ingress <INGRESS_NAME> -n <NAMESPACE> \
nginx.ingress.kubernetes.io/proxy-read-timeout="120" --overwrite
Verification¶
# Confirm the issue is resolved
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 20 min | SRE on-call | "Kubernetes Ingress 502 in |
| Data loss suspected | Platform Lead | "Data loss risk: in-flight requests to |
| Scope expanding beyond namespace | Platform team | "Multi-namespace impact: ingress controller returning 502 for all hosts, controller may be down" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add readiness probe health check for the failing dependency if that was the root cause
- Review timeout settings to ensure they match actual backend SLAs
Common Mistakes¶
- Checking ingress config when the problem is the backend pods: The ingress configuration (annotations, rules) almost never changes on its own. When a previously working ingress starts returning 502, the ingress config is rarely the problem. Go straight to checking Service endpoints (Step 3) and pod readiness (Step 4) before spending time reviewing ingress YAML.
- Ignoring readiness probe failures: A pod that is
Runningbut notReadyhas been removed from Service endpoints by design. Engineers seeRunninginkubectl get podsand conclude the pod is fine, then waste time debugging the ingress or network. Always check theREADYcolumn —0/1means the pod is explicitly excluded from the load balancer. - Not testing the backend service directly: Before concluding the ingress is the problem, always test the backend service directly from inside the cluster (using
kubectl run). This confirms whether the 502 is at the ingress layer or the backend layer, cutting diagnosis time in half.
Cross-References¶
- Survival Guide: On-Call Survival Guide (pocket card version)
- Topic Pack: Kubernetes Topics (deep background)
- Related Runbook: pod-crashloop.md — if backend pods are crashing
- Related Runbook: deploy-stuck.md — if a recent deployment broke the backend
- Related Runbook: oom-kill.md — if backend pods are being OOMKilled under load
Wiki Navigation¶
Related Content¶
- Kubernetes Services & Ingress (Topic Pack, L1) — Kubernetes Networking, Kubernetes Services & Ingress
- API Gateways & Ingress (Topic Pack, L2) — Kubernetes Networking
- Case Study: CNI Broken After Restart (Case Study, L2) — Kubernetes Networking
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Networking
- Case Study: CoreDNS Timeout Pod DNS (Case Study, L2) — Kubernetes Networking
- Case Study: Grafana Dashboard Empty — Prometheus Blocked by NetworkPolicy (Case Study, L2) — Kubernetes Networking
- Case Study: Service Mesh 503s — Envoy Misconfigured, RBAC Policy (Case Study, L2) — Kubernetes Networking
- Case Study: Service No Endpoints (Case Study, L1) — Kubernetes Networking
- Cilium & eBPF Networking (Topic Pack, L2) — Kubernetes Networking
- Deep Dive: Kubernetes Networking (deep_dive, L2) — Kubernetes Networking
Pages that link here¶
- Cilium
- Kubernetes Services & Ingress
- Kubernetes Services & Ingress - Primer
- On-Call Survival Guides
- Operational Runbooks
- Runbook: Deployment Stuck / Rollout Stalled
- Runbook: OOMKilled Container
- Runbook: Pod CrashLoopBackOff
- Scenario: Ingress Returns 404 Intermittently
- Symptoms: Service Mesh 503s, Envoy Misconfigured, Root Cause Is RBAC Policy