Portal | Level: L3: Advanced | Topics: Database Operations | Domain: Kubernetes
Scenario: Database Failover During Deployment¶
The Prompt¶
"During a routine deployment, your PostgreSQL primary pod was evicted for a node drain. The operator promoted a replica, but the application is now throwing 'connection refused' errors. What do you do?"
Initial Report¶
PagerDuty: "grokdevops API returning 500 errors. Database connection pool exhausted. pg-cluster-1 primary changed from pod-0 to pod-1."
Constraints¶
- Time pressure: All API requests are failing.
- Database operator is managing PostgreSQL (CloudNativePG or similar).
- No manual database intervention without understanding the operator's state.
Observable Evidence¶
- Application logs:
FATAL: connection to server at "pg-cluster-rw:5432" refused kubectl get pods -n database: pod-0 is Pending (node drained), pod-1 is Running- Operator logs show: "promoted pod-1 to primary"
- Service
pg-cluster-rwendpoints may be stale
Expected Investigation Path¶
# 1. Check database cluster status
kubectl get cluster pg-cluster -n database -o yaml | grep -A10 status
# 2. Check the read-write service endpoints
kubectl get endpoints pg-cluster-rw -n database
# May show no endpoints or still pointing to old primary
# 3. Check operator logs
kubectl logs deploy/cnpg-controller-manager -n cnpg-system --tail=100
# 4. Check the new primary's readiness
kubectl exec pg-cluster-1 -n database -- pg_isready
# 5. If service endpoints are stale, the operator may need a moment
# Wait for operator to reconcile (usually < 30s)
# 6. If app is using connection pooling, connections may be cached
# Force pool refresh:
kubectl rollout restart deployment grokdevops -n grokdevops
# 7. Verify application recovery
kubectl logs deploy/grokdevops -n grokdevops --tail=20
Root Cause¶
During node drain, the primary pod was evicted. The operator detected the failure and promoted a replica. However: 1. The application's connection pool held stale connections to the old primary 2. The read-write Service needed time to update endpoints 3. PgBouncer (if used) cached the old connection
What a Strong Answer Includes¶
- Understanding that database operators handle failover automatically
- Don't panic — check if the operator has already promoted a replica
- Knowledge that connection pools cache connections (need restart or pool refresh)
- The fix may just be waiting for the operator to finish reconciliation
- Rolling restart of application pods to get fresh connections
- Post-incident: ensure PodDisruptionBudgets protect database pods, set proper drain timeouts
Wiki Navigation¶
Related Content¶
- AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations on Kubernetes (Topic Pack, L2) — Database Operations
- Database Ops Drills (Drill, L2) — Database Operations
- PostgreSQL Operations (Topic Pack, L2) — Database Operations
- Redis Operations (Topic Pack, L2) — Database Operations
- SQL Fundamentals (Topic Pack, L0) — Database Operations
- SQLite Operations & Internals (Topic Pack, L2) — Database Operations
- Skillcheck: Database Ops (Assessment, L2) — Database Operations
Pages that link here¶
- Database Operations - Skill Check
- Database Operations Drills
- Database Operations on Kubernetes - Primer
- Interview Gauntlet: Customer Reports Data Inconsistency
- Interview Gauntlet: Disk Usage on Prod Database
- Interview Gauntlet: Handling a Production Incident
- Interview Gauntlet: Managed Database or Self-Hosted?
- Interview Gauntlet: Multi-Region Kubernetes Deployment
- Interview Scenarios
- Level 7: SRE & Cloud Operations
- Master Curriculum: 40 Weeks
- PostgreSQL Operations - Primer
- Redis
- Redis Operations - Primer
- SQL Fundamentals