Answer Key: The DR That Looks Ready But Isn't¶
The System¶
A multi-region active-passive disaster recovery setup for a platform API:
[Route53: api.megacorp.io]
Failover routing policy
/ \
PRIMARY SECONDARY
[us-east cluster] [us-west (dr-west) cluster]
| |
[api-server pods] [api-server pods (3)]
[worker pods] [worker pods (2)]
| |
[pg-primary-east] [pg-replica-west (read-only)]
10.0.50.11:5432 local read replica
| |
Read + Write Read only (no promotion config)
|
Health check: /health Health check: /health
(checks local DB conn) (checks local DB conn -- PASSES)
Route53 sends traffic to us-east (primary). If the primary health check fails, Route53 fails over to us-west (secondary). The health check hits /health on the ELB, which returns 200 as long as the API can connect to any database.
What's Broken¶
Root cause: Multiple compounding failures make DR non-functional:
-
Health check tests the wrong thing. The
/healthendpoint checksdatabase: connectedagainst the local read replica, not the primary write database. So/healthreturns 200 even though the DR cluster cannot perform write operations. -
Application is hardcoded to the primary. The API server is configured to write to
pg-primary-east(10.0.50.11) which is in us-east. From the DR cluster in us-west, this connection is refused (cross-region, no peering, or security group blocks it). Reads may work against the local replica, but writes fail. -
Route53 health check has
evaluate_target_health = false. Even the Route53 alias record does not evaluate the ELB's target health, so unhealthy targets behind the ELB would not affect DNS. -
The read replica has no promotion path. The Terraform log shows the replica was being created (45 minutes elapsed) and may not have a promotion runbook or automated failover.
-
Zero real traffic validates the DR path. The 0 requests to
/api/v1/ordersand/api/v1/usersmeans DR has never been tested with real traffic.
Key clue: The API server error log shows "connection to 10.0.50.11:5432 refused (pg-primary-east)" — the DR application tries to connect to the primary database in the east region and fails. But /health returns OK because it checks a different connection.
The Fix¶
Immediate (make health check accurate)¶
-
Update the health check endpoint to test write capability:
@app.get("/health") async def health(): # Test actual write path, not just connection try: await db.execute("SELECT 1 FROM orders LIMIT 1") write_ok = await test_write_to_primary() except Exception as e: return JSONResponse({"status": "degraded", "database": str(e)}, status_code=503) return {"status": "ok", "read": True, "write": write_ok} -
Or use a dedicated health check path for Route53:
Permanent (fix DR architecture)¶
-
Database failover: Configure the read replica as a promotion candidate:
-
Application configuration: Use DNS-based database endpoints that switch during failover:
-
Enable target health evaluation:
-
Regular DR testing: Schedule synthetic traffic to the DR cluster to validate the full path.
Verification¶
# Test health endpoint returns accurate status
kubectl --context=dr-west exec -n platform deploy/api-server -- \
curl -s http://localhost:8080/health
# Test write path explicitly
kubectl --context=dr-west exec -n platform deploy/api-server -- \
curl -s -X POST http://localhost:8080/api/v1/test-write
# Check Route53 health check status
aws route53 get-health-check-status --health-check-id HC-XXXXX
# Verify database replica status
aws rds describe-db-instances --db-instance-identifier replica-west \
--query 'DBInstances[0].{Status:DBInstanceStatus,ReplicaLag:StatusInfos}'
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | All pods Running, /health returns OK, /ready returns OK — everything looks green | 1/1 Running and healthy responses hide the fact that the system cannot serve real traffic |
| Metrics | Zero real API requests, NaN query duration for orders = DR never tested | 86,400 health check requests make the cluster look active; 2 DB connections look normal |
| IaC Snippet | evaluate_target_health = false + health check on /health = incomplete validation |
Route53 failover config looks textbook-correct at first glance |
| Log Lines | "connection to pg-primary-east refused" reveals the write path is broken | Health check "passed" log line directly contradicts the error log from the same pod |
Skills Demonstrated¶
- Evaluating disaster recovery readiness beyond surface-level health checks
- Understanding Route53 failover routing and health check semantics
- Recognizing the difference between read path and write path in database architectures
- Identifying the gap between monitoring green status and actual operational capability
- Designing health checks that test what actually matters