Skip to content

Progressive Hints

Hint 1 (after 5 min)

Look at the request counts: /health and /ready have 86,400 hits (once per second for 24 hours — from health checks), but /api/v1/orders and /api/v1/users have zero requests. Nobody is using the DR cluster for real traffic. Now look at the database metrics: only 2 active connections, and db_query_duration_seconds for select_orders is NaN — meaning no order queries have ever been executed.

Hint 2 (after 10 min)

The health check endpoint returns {"database":"connected"}, but look at the API server error log: "connection to 10.0.50.11:5432 refused (pg-primary-east)." The health check succeeds because it checks the local database connection (which works — a read replica exists in DR). But actual application queries try to reach the primary database in the east region (10.0.50.11) for writes, and that connection is refused from the west cluster. The health check is not testing what matters.

Hint 3 (after 15 min)

This is a multi-region active-passive setup. The primary cluster is in us-east, the DR cluster is in us-west. Route53 failover routing uses a health check against /health on the DR cluster's ELB. The health check passes because the /health endpoint only verifies that the app can connect to a database — likely a local read replica. But the application is hardcoded (or configured) to write to the primary database in us-east (pg-primary-east at 10.0.50.11), which is unreachable from the DR cluster. If failover occurs, users will reach the DR API servers, health checks will pass, but every write operation will fail. The Terraform log shows a replica was being provisioned 3 days ago and may not be properly configured as a promotion candidate.