Portal | Level: L3: Advanced | Topics: Database Operations | Domain: Kubernetes

Scenario: Database Failover During Deployment¶

The Prompt¶

"During a routine deployment, your PostgreSQL primary pod was evicted for a node drain. The operator promoted a replica, but the application is now throwing 'connection refused' errors. What do you do?"

Initial Report¶

PagerDuty: "grokdevops API returning 500 errors. Database connection pool exhausted. pg-cluster-1 primary changed from pod-0 to pod-1."

Constraints¶

Time pressure: All API requests are failing.
Database operator is managing PostgreSQL (CloudNativePG or similar).
No manual database intervention without understanding the operator's state.

Observable Evidence¶

Application logs: FATAL: connection to server at "pg-cluster-rw:5432" refused
kubectl get pods -n database: pod-0 is Pending (node drained), pod-1 is Running
Operator logs show: "promoted pod-1 to primary"
Service pg-cluster-rw endpoints may be stale

Expected Investigation Path¶

# 1. Check database cluster status
kubectl get cluster pg-cluster -n database -o yaml | grep -A10 status

# 2. Check the read-write service endpoints
kubectl get endpoints pg-cluster-rw -n database
# May show no endpoints or still pointing to old primary

# 3. Check operator logs
kubectl logs deploy/cnpg-controller-manager -n cnpg-system --tail=100

# 4. Check the new primary's readiness
kubectl exec pg-cluster-1 -n database -- pg_isready

# 5. If service endpoints are stale, the operator may need a moment
# Wait for operator to reconcile (usually < 30s)

# 6. If app is using connection pooling, connections may be cached
# Force pool refresh:
kubectl rollout restart deployment grokdevops -n grokdevops

# 7. Verify application recovery
kubectl logs deploy/grokdevops -n grokdevops --tail=20

Root Cause¶

During node drain, the primary pod was evicted. The operator detected the failure and promoted a replica. However: 1. The application's connection pool held stale connections to the old primary 2. The read-write Service needed time to update endpoints 3. PgBouncer (if used) cached the old connection

What a Strong Answer Includes¶

Understanding that database operators handle failover automatically
Don't panic — check if the operator has already promoted a replica
Knowledge that connection pools cache connections (need restart or pool refresh)
The fix may just be waiting for the operator to finish reconciliation
Rolling restart of application pods to get fresh connections
Post-incident: ensure PodDisruptionBudgets protect database pods, set proper drain timeouts

AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Database Operations on Kubernetes (Topic Pack, L2) — Database Operations
Database Ops Drills (Drill, L2) — Database Operations
PostgreSQL Operations (Topic Pack, L2) — Database Operations
Redis Operations (Topic Pack, L2) — Database Operations
SQL Fundamentals (Topic Pack, L0) — Database Operations
SQLite Operations & Internals (Topic Pack, L2) — Database Operations
Skillcheck: Database Ops (Assessment, L2) — Database Operations

Scenario: Database Failover During Deployment¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Root Cause¶

What a Strong Answer Includes¶

Wiki Navigation¶

Pages that link here¶

Scenario: Database Failover During Deployment¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Root Cause¶

What a Strong Answer Includes¶

Wiki Navigation¶

Related Content¶

Pages that link here¶