Skip to content

Ops Archaeology: The Requests That Vanish

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, Service Mesh, Observability


Artifact 1: CLI Output

$ kubectl get pods -n checkout -l app=payment-service
NAME                               READY   STATUS    RESTARTS   AGE
payment-service-5d8f7a9b34-g4h2j   2/2     Running   0          6d
payment-service-5d8f7a9b34-k8m3n   2/2     Running   0          6d
payment-service-5d8f7a9b34-p1q5r   2/2     Running   0          6d

$ kubectl get pods -n checkout -l app=order-service
NAME                              READY   STATUS    RESTARTS   AGE
order-service-7c4e6f8a21-b9d3f    2/2     Running   0          6d
order-service-7c4e6f8a21-t5v7w    2/2     Running   0          6d

$ kubectl top pods -n checkout
NAME                               CPU(cores)   MEMORY(bytes)
order-service-7c4e6f8a21-b9d3f     45m          128Mi
order-service-7c4e6f8a21-t5v7w     42m          131Mi
payment-service-5d8f7a9b34-g4h2j   380m         256Mi
payment-service-5d8f7a9b34-k8m3n   12m          89Mi
payment-service-5d8f7a9b34-p1q5r   15m          92Mi

Artifact 2: Metrics

# Istio sidecar metrics for payment-service (last 15 minutes)

# Request volume by pod
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-g4h2j",response_code="200"} 12847
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-k8m3n",response_code="200"} 94
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-p1q5r",response_code="200"} 107

# Latency percentiles (across all pods)
istio_request_duration_milliseconds{destination_workload="payment-service",quantile="0.5"} 45
istio_request_duration_milliseconds{destination_workload="payment-service",quantile="0.99"} 8720

# Error rate from order-service calling payment-service
istio_requests_total{source_workload="order-service",destination_workload="payment-service",response_code="504"} 342

Artifact 3: Infrastructure Code

# From: istio/destination-rule-payment.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: checkout
spec:
  host: payment-service.checkout.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 1
        http2MaxRequests: 10
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 180s
      maxEjectionPercent: 100

Artifact 4: Log Lines

[2024-11-28T14:33:41Z] payment-svc/g4h2j | WARN  Processing payment txn-88291 took 8.4s  upstream bank API slow
[2024-11-28T14:33:45Z] order-svc/b9d3f   | ERROR context deadline exceeded: POST /api/v1/payments/charge timeout=5s
[2024-11-28T14:33:02Z] istio-proxy/k8m3n  | [2024-11-28T14:33:02.441Z] "POST /api/v1/payments/charge HTTP/2" 200 - via_upstream - "-" 0 284 42 41 "-" "order-service/2.1.0" "req-id-7f3a"

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?