Skip to content

Ops Archaeology: The Gateway That Returns 502

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, Ingress, Helm, Networking


Artifact 1: CLI Output

$ kubectl get ingress -n api-platform
NAME              CLASS   HOSTS                    ADDRESS        PORTS     AGE
api-gateway       nginx   api.megacorp.io          10.0.50.12     80, 443   92d

$ kubectl get pods -n api-platform -l app=api-gateway
NAME                           READY   STATUS    RESTARTS   AGE
api-gateway-6b8f9d7c45-h2k9p  1/1     Running   0          3h
api-gateway-6b8f9d7c45-m4n7q  1/1     Running   0          3h

$ kubectl get svc -n api-platform
NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
api-gateway-svc   ClusterIP   10.96.117.44    <none>        8080/TCP   92d

$ kubectl get endpoints api-gateway-svc -n api-platform
NAME              ENDPOINTS                             AGE
api-gateway-svc   10.244.3.18:8080,10.244.5.22:8080     3h

Artifact 2: Metrics

# Nginx ingress controller metrics (last 5 minutes)
nginx_ingress_controller_requests{status="502",host="api.megacorp.io",path="/"} 1423
nginx_ingress_controller_requests{status="200",host="api.megacorp.io",path="/"} 0

# Backend response time (no data — backend never responds)
nginx_ingress_controller_response_duration_seconds_bucket{host="api.megacorp.io",le="+Inf"} 0

# Upstream connection errors
nginx_ingress_controller_nginx_process_connections{state="active"} 84

Artifact 3: Infrastructure Code

# From: helm/values-prod.yaml
apiGateway:
  service:
    port: 8080
    targetPort: 3000
  ingress:
    enabled: true
    className: nginx
    hosts:
      - host: api.megacorp.io
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-gateway-svc
                port:
                  number: 8080

Artifact 4: Log Lines

[2024-12-03T16:22:14Z] nginx-ingress | 10.0.50.1 - - "GET / HTTP/1.1" 502 150 "-" "curl/8.4.0" 462 0.001 [api-platform-api-gateway-svc-8080] [] 10.244.3.18:8080 0 0.001 502
[2024-12-03T16:22:14Z] nginx-ingress | upstream connect error: connect() failed (111: Connection refused) while connecting to upstream 10.244.3.18:8080
[2024-12-03T16:22:08Z] api-gateway  | {"level":"info","time":"2024-12-03T16:22:08Z","msg":"HTTP server listening","port":3000}

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?