Ops Archaeology: The DR That Looks Ready But Isn't¶

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L3 Estimated time: 40 min Domains: Multi-Cluster, Route53, Database, Disaster Recovery

Artifact 1: CLI Output¶

$ kubectl --context=dr-west get pods -n platform
NAME                              READY   STATUS    RESTARTS   AGE
api-server-6b8f9a7c43-d2f4g      1/1     Running   0          90d
api-server-6b8f9a7c43-h5j7k      1/1     Running   0          90d
api-server-6b8f9a7c43-m8n1p      1/1     Running   0          90d
worker-5d7e8f9a12-q3r5s           1/1     Running   0          90d
worker-5d7e8f9a12-t6u8v           1/1     Running   0          90d

$ kubectl --context=dr-west get svc -n platform
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP                              PORT(S)          AGE
api-server        LoadBalancer   10.100.42.18    a1b2c3-dr-west.us-west-2.elb.amazonaws.com   443:31892/TCP    90d
health-check      ClusterIP      10.100.42.99    <none>                                   8080/TCP         90d

$ kubectl --context=dr-west exec -n platform deploy/api-server -- curl -s http://localhost:8080/health
{"status":"ok","database":"connected","cache":"connected","timestamp":"2024-12-18T14:22:03Z"}

$ kubectl --context=dr-west exec -n platform deploy/api-server -- curl -s http://localhost:8080/ready
{"status":"ok","checks":{"database":true,"cache":true,"queue":true}}

Artifact 2: Metrics¶

# DR cluster metrics (scraped locally in dr-west)

# API server request rate (last 24h)
http_requests_total{cluster="dr-west",handler="/health"} 86400
http_requests_total{cluster="dr-west",handler="/ready"} 86400
http_requests_total{cluster="dr-west",handler="/api/v1/orders"} 0
http_requests_total{cluster="dr-west",handler="/api/v1/users"} 0

# Database connection pool
db_pool_active_connections{cluster="dr-west"} 2
db_pool_idle_connections{cluster="dr-west"} 8
db_pool_max_connections{cluster="dr-west"} 50
db_pool_wait_count_total{cluster="dr-west"} 0

# Database query latency
db_query_duration_seconds{cluster="dr-west",query="health_check",quantile="0.99"} 0.003
db_query_duration_seconds{cluster="dr-west",query="select_orders",quantile="0.99"} NaN

Artifact 3: Infrastructure Code¶

# From: terraform/modules/dns/failover.tf
resource "aws_route53_health_check" "dr_west" {
  fqdn              = "a1b2c3-dr-west.us-west-2.elb.amazonaws.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 30

  tags = {
    Name = "dr-west-health-check"
  }
}

resource "aws_route53_record" "api_failover_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.megacorp.io"
  type    = "A"

  alias {
    name                   = "a1b2c3-dr-west.us-west-2.elb.amazonaws.com"
    zone_id                = "Z1H1FL5HABSF5"
    evaluate_target_health = false
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "dr-west"
  health_check_id = aws_route53_health_check.dr_west.id
}

Artifact 4: Log Lines¶

[2024-12-18T14:22:03Z] api-server/dr-west  | INFO  Health check passed: database=connected cache=connected
[2024-12-18T14:21:58Z] api-server/dr-west  | ERROR Failed to execute query on orders table: connection to 10.0.50.11:5432 refused (pg-primary-east)
[2024-12-15T09:00:01Z] terraform/dr-west   | aws_db_instance.replica_west: Still creating... [45m elapsed]

Your Mission¶

Reconstruct: What does this system do? What are its components and purpose?
Diagnose: What is currently broken or degraded, and why?
Propose: What would you do to fix it? What would you check first?