Skip to content

Portal | Level: L3: Advanced | Topics: Database Operations | Domain: Kubernetes

Scenario: Database Failover During Deployment

The Prompt

"During a routine deployment, your PostgreSQL primary pod was evicted for a node drain. The operator promoted a replica, but the application is now throwing 'connection refused' errors. What do you do?"

Initial Report

PagerDuty: "grokdevops API returning 500 errors. Database connection pool exhausted. pg-cluster-1 primary changed from pod-0 to pod-1."

Constraints

  • Time pressure: All API requests are failing.
  • Database operator is managing PostgreSQL (CloudNativePG or similar).
  • No manual database intervention without understanding the operator's state.

Observable Evidence

  • Application logs: FATAL: connection to server at "pg-cluster-rw:5432" refused
  • kubectl get pods -n database: pod-0 is Pending (node drained), pod-1 is Running
  • Operator logs show: "promoted pod-1 to primary"
  • Service pg-cluster-rw endpoints may be stale

Expected Investigation Path

# 1. Check database cluster status
kubectl get cluster pg-cluster -n database -o yaml | grep -A10 status

# 2. Check the read-write service endpoints
kubectl get endpoints pg-cluster-rw -n database
# May show no endpoints or still pointing to old primary

# 3. Check operator logs
kubectl logs deploy/cnpg-controller-manager -n cnpg-system --tail=100

# 4. Check the new primary's readiness
kubectl exec pg-cluster-1 -n database -- pg_isready

# 5. If service endpoints are stale, the operator may need a moment
# Wait for operator to reconcile (usually < 30s)

# 6. If app is using connection pooling, connections may be cached
# Force pool refresh:
kubectl rollout restart deployment grokdevops -n grokdevops

# 7. Verify application recovery
kubectl logs deploy/grokdevops -n grokdevops --tail=20

Root Cause

During node drain, the primary pod was evicted. The operator detected the failure and promoted a replica. However: 1. The application's connection pool held stale connections to the old primary 2. The read-write Service needed time to update endpoints 3. PgBouncer (if used) cached the old connection

What a Strong Answer Includes

  • Understanding that database operators handle failover automatically
  • Don't panic — check if the operator has already promoted a replica
  • Knowledge that connection pools cache connections (need restart or pool refresh)
  • The fix may just be waiting for the operator to finish reconciliation
  • Rolling restart of application pods to get fresh connections
  • Post-incident: ensure PodDisruptionBudgets protect database pods, set proper drain timeouts

Wiki Navigation