Ops Archaeology: The DR That Looks Ready But Isn't¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L3 Estimated time: 40 min Domains: Multi-Cluster, Route53, Database, Disaster Recovery
Artifact 1: CLI Output¶
$ kubectl --context=dr-west get pods -n platform
NAME READY STATUS RESTARTS AGE
api-server-6b8f9a7c43-d2f4g 1/1 Running 0 90d
api-server-6b8f9a7c43-h5j7k 1/1 Running 0 90d
api-server-6b8f9a7c43-m8n1p 1/1 Running 0 90d
worker-5d7e8f9a12-q3r5s 1/1 Running 0 90d
worker-5d7e8f9a12-t6u8v 1/1 Running 0 90d
$ kubectl --context=dr-west get svc -n platform
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
api-server LoadBalancer 10.100.42.18 a1b2c3-dr-west.us-west-2.elb.amazonaws.com 443:31892/TCP 90d
health-check ClusterIP 10.100.42.99 <none> 8080/TCP 90d
$ kubectl --context=dr-west exec -n platform deploy/api-server -- curl -s http://localhost:8080/health
{"status":"ok","database":"connected","cache":"connected","timestamp":"2024-12-18T14:22:03Z"}
$ kubectl --context=dr-west exec -n platform deploy/api-server -- curl -s http://localhost:8080/ready
{"status":"ok","checks":{"database":true,"cache":true,"queue":true}}
Artifact 2: Metrics¶
# DR cluster metrics (scraped locally in dr-west)
# API server request rate (last 24h)
http_requests_total{cluster="dr-west",handler="/health"} 86400
http_requests_total{cluster="dr-west",handler="/ready"} 86400
http_requests_total{cluster="dr-west",handler="/api/v1/orders"} 0
http_requests_total{cluster="dr-west",handler="/api/v1/users"} 0
# Database connection pool
db_pool_active_connections{cluster="dr-west"} 2
db_pool_idle_connections{cluster="dr-west"} 8
db_pool_max_connections{cluster="dr-west"} 50
db_pool_wait_count_total{cluster="dr-west"} 0
# Database query latency
db_query_duration_seconds{cluster="dr-west",query="health_check",quantile="0.99"} 0.003
db_query_duration_seconds{cluster="dr-west",query="select_orders",quantile="0.99"} NaN
Artifact 3: Infrastructure Code¶
# From: terraform/modules/dns/failover.tf
resource "aws_route53_health_check" "dr_west" {
fqdn = "a1b2c3-dr-west.us-west-2.elb.amazonaws.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
tags = {
Name = "dr-west-health-check"
}
}
resource "aws_route53_record" "api_failover_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.megacorp.io"
type = "A"
alias {
name = "a1b2c3-dr-west.us-west-2.elb.amazonaws.com"
zone_id = "Z1H1FL5HABSF5"
evaluate_target_health = false
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "dr-west"
health_check_id = aws_route53_health_check.dr_west.id
}
Artifact 4: Log Lines¶
[2024-12-18T14:22:03Z] api-server/dr-west | INFO Health check passed: database=connected cache=connected
[2024-12-18T14:21:58Z] api-server/dr-west | ERROR Failed to execute query on orders table: connection to 10.0.50.11:5432 refused (pg-primary-east)
[2024-12-15T09:00:01Z] terraform/dr-west | aws_db_instance.replica_west: Still creating... [45m elapsed]
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?