Skip to content

Runbook: Ingress 502 Bad Gateway

Field Value
Domain Kubernetes
Alert nginx_ingress_controller_requests{status="502"} > threshold or user reports
Severity P1
Est. Resolution Time 15-30 minutes
Escalation Timeout 20 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites kubectl access, cluster-admin or namespace-admin, kubeconfig configured

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
kubectl get ingress -A -o wide
If output shows: multiple ingresses affected → The ingress controller itself may be down — jump to Step 2 immediately If output shows: a single ingress returning 502 → Continue with steps below to narrow down to backend or ingress config

Step 1: Confirm 502 Scope — All Paths or Specific

Why: A 502 on one path means one backend service is broken. A 502 on all paths of an ingress usually means the ingress controller cannot reach any backend, which is a different problem.

# Test from inside the cluster to isolate whether it is the ingress or backend
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>

# Check the ingress rules to see which services are backend
kubectl describe ingress <INGRESS_NAME> -n <NAMESPACE>
Expected output (502 response):
< HTTP/1.1 502 Bad Gateway
< Content-Type: text/html
Expected output (ingress rules):
Rules:
  Host           Path  Backends
  ----           ----  --------
  api.example.com
                 /users   user-service:8080 (10.0.1.5:8080)
                 /orders  order-service:8080 (No Endpoints!)   <-- problem here
If specific paths show No Endpoints: The backend service has no ready pods — go to Step 4. If all paths are affected: Go to Step 2 to check the ingress controller itself.

Step 2: Check Ingress Controller Logs

Why: The NGINX ingress controller logs the exact reason for each 502, including upstream connection refusals, timeouts, and connect errors. This is the fastest path to root cause.

# Find the ingress controller pod
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Tail the logs and filter for 502s and errors
kubectl logs -n ingress-nginx <INGRESS_CONTROLLER_POD> --tail=100 | grep -E "502|upstream|error|connect"

# If there are multiple controller pods, check all of them
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50 | grep -E "502|upstream|error"
Expected output (upstream connection refused):
2026/03/19 10:00:00 [error] 29#29: *1234 connect() failed (111: Connection refused) while connecting to upstream,
  client: 10.0.0.1, server: api.example.com, request: "GET /orders HTTP/1.1", upstream: "http://10.0.1.8:8080/orders"
If logs show connect() failed (111: Connection refused): The backend pod is up but the port is wrong, or the pod is not listening on that port — continue to Step 3. If logs show upstream timed out: The backend is reachable but slow — check application performance and see Step 6 for timeout tuning. If the ingress controller pod is not running: Restart it with kubectl rollout restart deployment/ingress-nginx-controller -n ingress-nginx.

Step 3: Check Backend Service Endpoints

Why: A Kubernetes Service with no Endpoints means no ready pods are backing it. The ingress controller returns 502 when it cannot forward to any endpoint.

# Check if the Service has endpoints
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>

# More detail on the service
kubectl describe service <SERVICE_NAME> -n <NAMESPACE>
Expected output (healthy service with endpoints):
NAME           ENDPOINTS                         AGE
order-service  10.0.1.5:8080,10.0.1.6:8080      2d
Expected output (broken — no endpoints):
NAME           ENDPOINTS   AGE
order-service  <none>      2d
If endpoints is <none>: The Service's selector does not match any running pods — continue to Step 4. If endpoints exist but 502 still occurs: The pods are registered but not accepting connections — check pod readiness (Step 4) and port mappings.

Step 4: Check Pod Readiness

Why: Kubernetes removes pods from Service endpoints when their readiness probe fails. If all pods fail readiness, the Service has no endpoints and all traffic returns 502. This is distinct from pods being Down — they may be running but not ready.

# Check pod readiness status
kubectl get pods -n <NAMESPACE> -l <SERVICE_SELECTOR_LABEL>=<SERVICE_SELECTOR_VALUE> -o wide

# Check readiness probe configuration
kubectl describe deployment <DEPLOYMENT_NAME> -n <NAMESPACE> | grep -A 10 "Readiness"

# Check recent events for readiness failures
kubectl get events -n <NAMESPACE> --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20
Expected output (readiness failure events):
Warning   Unhealthy   45s   kubelet
  Readiness probe failed: HTTP probe failed with statuscode: 503
If readiness probes are failing: The application is running but not yet healthy (perhaps waiting for a database connection or a cache warmup). Check application logs:
kubectl logs <POD_NAME> -n <NAMESPACE> --tail=50
If the application is waiting on a dependency (DB, cache, upstream API): Check that dependency first — the 502 is a symptom of a deeper problem.

Step 5: Check Upstream Service Health

Why: The backend pods may be healthy, but if they depend on a database, cache, or third-party API that is down, they will return errors that manifest as 502s.

# Test the backend service directly, bypassing the ingress
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<SERVICE_NAME>.<NAMESPACE>.svc.cluster.local:<PORT>/<PATH>

# Check if the service's own dependencies are up
kubectl get pods -n <DEPENDENCY_NAMESPACE> | grep <DEPENDENCY_APP>

# Check logs for dependency connection errors
kubectl logs <POD_NAME> -n <NAMESPACE> --tail=100 | grep -iE "connect|refused|timeout|database|redis|unavailable"
Expected output (dependency down — logs):
Error: ECONNREFUSED - connect ECONNREFUSED 10.0.2.5:5432
If a dependency is down: Resolve the dependency first. The 502s will clear automatically once the backend pods pass their readiness probes. If this fails: The problem may be with the application code itself — escalate to the development team.

Step 6: Check Proxy Timeouts

Why: If the backend service is slow (e.g., due to a slow database query or large payload), the default NGINX proxy timeouts (60s for read, 60s for connect) may be too short, causing NGINX to abort the connection and return 502 before the backend responds.

# Check current ingress annotations for timeout settings
kubectl get ingress <INGRESS_NAME> -n <NAMESPACE> -o yaml | grep -i timeout

# Common timeout annotations for nginx ingress:
# nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# nginx.ingress.kubernetes.io/proxy-read-timeout: "60"

# Patch if timeouts are too short (example: extend read timeout to 120s)
kubectl annotate ingress <INGRESS_NAME> -n <NAMESPACE> \
  nginx.ingress.kubernetes.io/proxy-read-timeout="120" --overwrite
Expected output (annotation applied):
ingress.networking.k8s.io/<INGRESS_NAME> annotated
If 502s stop after extending timeouts: The backend is too slow — this is a performance issue that needs a longer-term fix. Notify the development team. If 502s continue after extending timeouts: The backend is not responding at all — return to Step 4/5.

Verification

# Confirm the issue is resolved
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://<INGRESS_HOST>/<PROBLEM_PATH>
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
Success looks like: HTTP 200 response from the test pod, endpoints show healthy pod IPs, and ingress controller logs show no new 502 errors. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 20 min SRE on-call "Kubernetes Ingress 502 in , ingress , backend unresponsive, runbook exhausted"
Data loss suspected Platform Lead "Data loss risk: in-flight requests to returning 502, requests not being processed"
Scope expanding beyond namespace Platform team "Multi-namespace impact: ingress controller returning 502 for all hosts, controller may be down"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Add readiness probe health check for the failing dependency if that was the root cause
  • Review timeout settings to ensure they match actual backend SLAs

Common Mistakes

  1. Checking ingress config when the problem is the backend pods: The ingress configuration (annotations, rules) almost never changes on its own. When a previously working ingress starts returning 502, the ingress config is rarely the problem. Go straight to checking Service endpoints (Step 3) and pod readiness (Step 4) before spending time reviewing ingress YAML.
  2. Ignoring readiness probe failures: A pod that is Running but not Ready has been removed from Service endpoints by design. Engineers see Running in kubectl get pods and conclude the pod is fine, then waste time debugging the ingress or network. Always check the READY column — 0/1 means the pod is explicitly excluded from the load balancer.
  3. Not testing the backend service directly: Before concluding the ingress is the problem, always test the backend service directly from inside the cluster (using kubectl run). This confirms whether the 502 is at the ingress layer or the backend layer, cutting diagnosis time in half.

Cross-References


Wiki Navigation