Ops Archaeology: The Requests That Vanish¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L2 Estimated time: 25 min Domains: Kubernetes, Service Mesh, Observability
Artifact 1: CLI Output¶
$ kubectl get pods -n checkout -l app=payment-service
NAME READY STATUS RESTARTS AGE
payment-service-5d8f7a9b34-g4h2j 2/2 Running 0 6d
payment-service-5d8f7a9b34-k8m3n 2/2 Running 0 6d
payment-service-5d8f7a9b34-p1q5r 2/2 Running 0 6d
$ kubectl get pods -n checkout -l app=order-service
NAME READY STATUS RESTARTS AGE
order-service-7c4e6f8a21-b9d3f 2/2 Running 0 6d
order-service-7c4e6f8a21-t5v7w 2/2 Running 0 6d
$ kubectl top pods -n checkout
NAME CPU(cores) MEMORY(bytes)
order-service-7c4e6f8a21-b9d3f 45m 128Mi
order-service-7c4e6f8a21-t5v7w 42m 131Mi
payment-service-5d8f7a9b34-g4h2j 380m 256Mi
payment-service-5d8f7a9b34-k8m3n 12m 89Mi
payment-service-5d8f7a9b34-p1q5r 15m 92Mi
Artifact 2: Metrics¶
# Istio sidecar metrics for payment-service (last 15 minutes)
# Request volume by pod
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-g4h2j",response_code="200"} 12847
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-k8m3n",response_code="200"} 94
istio_requests_total{destination_workload="payment-service",destination_pod="payment-service-5d8f7a9b34-p1q5r",response_code="200"} 107
# Latency percentiles (across all pods)
istio_request_duration_milliseconds{destination_workload="payment-service",quantile="0.5"} 45
istio_request_duration_milliseconds{destination_workload="payment-service",quantile="0.99"} 8720
# Error rate from order-service calling payment-service
istio_requests_total{source_workload="order-service",destination_workload="payment-service",response_code="504"} 342
Artifact 3: Infrastructure Code¶
# From: istio/destination-rule-payment.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
namespace: checkout
spec:
host: payment-service.checkout.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 1
http2MaxRequests: 10
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 180s
maxEjectionPercent: 100
Artifact 4: Log Lines¶
[2024-11-28T14:33:41Z] payment-svc/g4h2j | WARN Processing payment txn-88291 took 8.4s — upstream bank API slow
[2024-11-28T14:33:45Z] order-svc/b9d3f | ERROR context deadline exceeded: POST /api/v1/payments/charge timeout=5s
[2024-11-28T14:33:02Z] istio-proxy/k8m3n | [2024-11-28T14:33:02.441Z] "POST /api/v1/payments/charge HTTP/2" 200 - via_upstream - "-" 0 284 42 41 "-" "order-service/2.1.0" "req-id-7f3a"
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?