Ops Archaeology: The Gateway That Returns 502¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, Ingress, Helm, Networking
Artifact 1: CLI Output¶
$ kubectl get ingress -n api-platform
NAME CLASS HOSTS ADDRESS PORTS AGE
api-gateway nginx api.megacorp.io 10.0.50.12 80, 443 92d
$ kubectl get pods -n api-platform -l app=api-gateway
NAME READY STATUS RESTARTS AGE
api-gateway-6b8f9d7c45-h2k9p 1/1 Running 0 3h
api-gateway-6b8f9d7c45-m4n7q 1/1 Running 0 3h
$ kubectl get svc -n api-platform
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
api-gateway-svc ClusterIP 10.96.117.44 <none> 8080/TCP 92d
$ kubectl get endpoints api-gateway-svc -n api-platform
NAME ENDPOINTS AGE
api-gateway-svc 10.244.3.18:8080,10.244.5.22:8080 3h
Artifact 2: Metrics¶
# Nginx ingress controller metrics (last 5 minutes)
nginx_ingress_controller_requests{status="502",host="api.megacorp.io",path="/"} 1423
nginx_ingress_controller_requests{status="200",host="api.megacorp.io",path="/"} 0
# Backend response time (no data — backend never responds)
nginx_ingress_controller_response_duration_seconds_bucket{host="api.megacorp.io",le="+Inf"} 0
# Upstream connection errors
nginx_ingress_controller_nginx_process_connections{state="active"} 84
Artifact 3: Infrastructure Code¶
# From: helm/values-prod.yaml
apiGateway:
service:
port: 8080
targetPort: 3000
ingress:
enabled: true
className: nginx
hosts:
- host: api.megacorp.io
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-gateway-svc
port:
number: 8080
Artifact 4: Log Lines¶
[2024-12-03T16:22:14Z] nginx-ingress | 10.0.50.1 - - "GET / HTTP/1.1" 502 150 "-" "curl/8.4.0" 462 0.001 [api-platform-api-gateway-svc-8080] [] 10.244.3.18:8080 0 0.001 502
[2024-12-03T16:22:14Z] nginx-ingress | upstream connect error: connect() failed (111: Connection refused) while connecting to upstream 10.244.3.18:8080
[2024-12-03T16:22:08Z] api-gateway | {"level":"info","time":"2024-12-03T16:22:08Z","msg":"HTTP server listening","port":3000}
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?