Skip to content

Answer Key: The DR That Looks Ready But Isn't

The System

A multi-region active-passive disaster recovery setup for a platform API:

                    [Route53: api.megacorp.io]
                    Failover routing policy
                   /                        \
           PRIMARY                      SECONDARY
        [us-east cluster]            [us-west (dr-west) cluster]
              |                              |
        [api-server pods]            [api-server pods (3)]
        [worker pods]                [worker pods (2)]
              |                              |
        [pg-primary-east]            [pg-replica-west (read-only)]
        10.0.50.11:5432              local read replica
              |                              |
         Read + Write               Read only (no promotion config)
              |
        Health check: /health        Health check: /health
        (checks local DB conn)       (checks local DB conn -- PASSES)

Route53 sends traffic to us-east (primary). If the primary health check fails, Route53 fails over to us-west (secondary). The health check hits /health on the ELB, which returns 200 as long as the API can connect to any database.

What's Broken

Root cause: Multiple compounding failures make DR non-functional:

  1. Health check tests the wrong thing. The /health endpoint checks database: connected against the local read replica, not the primary write database. So /health returns 200 even though the DR cluster cannot perform write operations.

  2. Application is hardcoded to the primary. The API server is configured to write to pg-primary-east (10.0.50.11) which is in us-east. From the DR cluster in us-west, this connection is refused (cross-region, no peering, or security group blocks it). Reads may work against the local replica, but writes fail.

  3. Route53 health check has evaluate_target_health = false. Even the Route53 alias record does not evaluate the ELB's target health, so unhealthy targets behind the ELB would not affect DNS.

  4. The read replica has no promotion path. The Terraform log shows the replica was being created (45 minutes elapsed) and may not have a promotion runbook or automated failover.

  5. Zero real traffic validates the DR path. The 0 requests to /api/v1/orders and /api/v1/users means DR has never been tested with real traffic.

Key clue: The API server error log shows "connection to 10.0.50.11:5432 refused (pg-primary-east)" — the DR application tries to connect to the primary database in the east region and fails. But /health returns OK because it checks a different connection.

The Fix

Immediate (make health check accurate)

  1. Update the health check endpoint to test write capability:

    @app.get("/health")
    async def health():
        # Test actual write path, not just connection
        try:
            await db.execute("SELECT 1 FROM orders LIMIT 1")
            write_ok = await test_write_to_primary()
        except Exception as e:
            return JSONResponse({"status": "degraded", "database": str(e)}, status_code=503)
        return {"status": "ok", "read": True, "write": write_ok}
    

  2. Or use a dedicated health check path for Route53:

    resource "aws_route53_health_check" "dr_west" {
      resource_path = "/ready"   # Use readiness check that tests full functionality
    }
    

Permanent (fix DR architecture)

  1. Database failover: Configure the read replica as a promotion candidate:

    resource "aws_db_instance" "replica_west" {
      replicate_source_db = aws_db_instance.primary_east.arn
      # Enable automated promotion
      multi_az = true
    }
    

  2. Application configuration: Use DNS-based database endpoints that switch during failover:

    env:
      - name: DB_HOST
        value: "db.platform.internal"   # Route53 private zone, not hardcoded IP
    

  3. Enable target health evaluation:

    alias {
      evaluate_target_health = true   # Was false
    }
    

  4. Regular DR testing: Schedule synthetic traffic to the DR cluster to validate the full path.

Verification

# Test health endpoint returns accurate status
kubectl --context=dr-west exec -n platform deploy/api-server -- \
  curl -s http://localhost:8080/health

# Test write path explicitly
kubectl --context=dr-west exec -n platform deploy/api-server -- \
  curl -s -X POST http://localhost:8080/api/v1/test-write

# Check Route53 health check status
aws route53 get-health-check-status --health-check-id HC-XXXXX

# Verify database replica status
aws rds describe-db-instances --db-instance-identifier replica-west \
  --query 'DBInstances[0].{Status:DBInstanceStatus,ReplicaLag:StatusInfos}'

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output All pods Running, /health returns OK, /ready returns OK — everything looks green 1/1 Running and healthy responses hide the fact that the system cannot serve real traffic
Metrics Zero real API requests, NaN query duration for orders = DR never tested 86,400 health check requests make the cluster look active; 2 DB connections look normal
IaC Snippet evaluate_target_health = false + health check on /health = incomplete validation Route53 failover config looks textbook-correct at first glance
Log Lines "connection to pg-primary-east refused" reveals the write path is broken Health check "passed" log line directly contradicts the error log from the same pod

Skills Demonstrated

  • Evaluating disaster recovery readiness beyond surface-level health checks
  • Understanding Route53 failover routing and health check semantics
  • Recognizing the difference between read path and write path in database architectures
  • Identifying the gap between monitoring green status and actual operational capability
  • Designing health checks that test what actually matters

Prerequisite Topic Packs