Portal | Level: L2: Operations | Topics: Database Operations | Domain: Kubernetes
Database Operations Drills¶
Remember: The database backup hierarchy: Logical (pg_dump — portable, slow for large DBs), Physical (pg_basebackup — fast, same PG version only), Continuous (WAL archiving — enables point-in-time recovery). For production: use all three. Logical backups for disaster recovery portability, physical + WAL for fast PITR. Mnemonic: "LPC" — Logical, Physical, Continuous.
Gotcha:
pg_dumptakes a snapshot at the start of the dump. Long-running dumps on busy databases hold a transaction open, which blocks autovacuum and can cause table bloat. For large databases, usepg_dump --jobs=4for parallelism and schedule dumps during low-traffic windows.
Drill 1: StatefulSet vs Deployment for Databases¶
Difficulty: Easy
Q: Why must databases use StatefulSets instead of Deployments? What guarantees does a StatefulSet provide?
Answer
StatefulSet guarantees: 1. **Stable network identity**: pods get `db-0`, `db-1`, `db-2` (not random suffixes) 2. **Ordered deployment**: `db-0` starts before `db-1` 3. **Stable storage**: each pod gets its own PVC that persists across rescheduling 4. **Ordered termination**: `db-2` deleted before `db-1` A Deployment gives random pod names, no ordering, and shared PVCs — which would cause data corruption with databases.Drill 2: Database Backup with pg_dump¶
Difficulty: Easy
Q: Write a command to back up a PostgreSQL database running in a pod named postgres-0 to a local file.
Answer
# Backup
kubectl exec postgres-0 -n database -- \
pg_dump -U postgres -Fc mydb > backup-$(date +%Y%m%d).dump
# Restore to a different instance
kubectl exec -i postgres-0 -n database -- \
pg_restore -U postgres -d mydb --clean < backup-20240115.dump
# For large databases, use streaming:
kubectl exec postgres-0 -n database -- \
pg_dump -U postgres -Fc mydb | gzip > backup.dump.gz
Drill 3: PVC Resize¶
Difficulty: Medium
Q: A database PVC is running out of space (48Gi used of 50Gi). How do you expand it without downtime?
Answer
# 1. Check if StorageClass supports expansion
kubectl get storageclass -o json | jq '.items[] | {name: .metadata.name, allowVolumeExpansion}'
# 2. Patch the PVC
kubectl patch pvc data-postgres-0 -n database -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
# 3. Monitor expansion
kubectl get pvc data-postgres-0 -n database -w
# Wait for conditions to show FileSystemResizePending → done
# 4. Some CSI drivers require pod restart for filesystem resize
kubectl delete pod postgres-0 -n database
# StatefulSet recreates it with the expanded volume
Drill 4: Connection Pooling with PgBouncer¶
Difficulty: Medium
Q: Why use a connection pooler, and how would you deploy PgBouncer in front of PostgreSQL?
Answer
Why: PostgreSQL forks a process per connection. 500 pods × 5 connections = 2500 backend connections = excessive memory. PgBouncer multiplexes many client connections over fewer backend connections.apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
spec:
replicas: 2
template:
spec:
containers:
- name: pgbouncer
image: edoburu/pgbouncer
env:
- name: DATABASE_URL
value: "postgres://user:pass@postgres:5432/mydb"
- name: POOL_MODE
value: "transaction" # transaction | session | statement
- name: DEFAULT_POOL_SIZE
value: "20"
- name: MAX_CLIENT_CONN
value: "1000"
ports:
- containerPort: 5432
Drill 5: Point-in-Time Recovery¶
Difficulty: Hard
Q: Your production database was corrupted by a bad migration at 14:30. You have WAL archiving enabled. Explain how to restore to 14:29.
Answer
# 1. Stop the database
kubectl scale statefulset postgres -n database --replicas=0
# 2. Restore the base backup
kubectl exec -it restore-pod -- bash
pg_basebackup or restore from S3/GCS
# 3. Configure recovery target
cat > /var/lib/postgresql/data/postgresql.auto.conf <<EOF
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
recovery_target_time = '2024-01-15 14:29:00+00'
recovery_target_action = 'promote'
EOF
# 4. Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal
# 5. Start PostgreSQL — it replays WAL up to the target time
kubectl scale statefulset postgres -n database --replicas=1
# 6. Verify data integrity
kubectl exec postgres-0 -- psql -U postgres -c "SELECT count(*) FROM important_table"
Drill 6: Database Failover¶
Difficulty: Hard
Q: In a 3-node PostgreSQL cluster (1 primary, 2 replicas), the primary node dies. What happens and what do you do?
Answer
With an operator (CloudNativePG, Zalando, Crunchy):Automatic:
1. Operator detects primary failure
2. Promotes the most up-to-date replica to primary
3. Reconfigures remaining replica to follow new primary
4. Updates the Service to point to new primary
5. RTO: typically 10-30 seconds
# 1. Verify primary is truly down
kubectl get pods -n database
kubectl logs postgres-0 -n database
# 2. Check replication lag on replicas
kubectl exec postgres-1 -- psql -U postgres -c \
"SELECT pg_last_wal_replay_lsn(), pg_last_wal_receive_lsn();"
# 3. Promote the most caught-up replica
kubectl exec postgres-1 -- pg_ctl promote -D /var/lib/postgresql/data
# 4. Update the read-write Service to point to the new primary
kubectl patch svc postgres -n database -p \
'{"spec":{"selector":{"statefulset.kubernetes.io/pod-name":"postgres-1"}}}'
# 5. Reconfigure postgres-2 to replicate from postgres-1
Drill 7: Monitor Database Health¶
Difficulty: Medium
Q: What key metrics should you monitor for a PostgreSQL database in Kubernetes?
Answer
# Connection usage (alert at 80% of max_connections)
pg_stat_activity_count / pg_settings_max_connections
# Replication lag (alert at > 1MB or > 10s)
pg_replication_lag_seconds
pg_replication_lag_bytes
# Transaction rate
rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])
# Slow queries
pg_stat_activity_max_tx_duration{state="active"}
# Dead tuples (need VACUUM)
pg_stat_user_tables_n_dead_tup
# Disk usage
pg_database_size_bytes
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
# Cache hit ratio (should be > 99%)
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)
Drill 8: Kubernetes Operator for Databases¶
Difficulty: Medium
Q: Compare running PostgreSQL manually via StatefulSet vs using the CloudNativePG operator. What do you gain?
Answer
| Feature | Manual StatefulSet | CloudNativePG Operator | |---------|-------------------|----------------------| | Failover | Manual promote + service patch | Automatic (< 30s) | | Backups | CronJob + pg_dump script | Declarative `backup` spec | | PITR | Manual WAL setup | Built-in with barman | | Monitoring | Manual exporter setup | Auto-configured | | Cert rotation | Manual | Automatic | | Rolling updates | Manual, risky | Orchestrated (replica first, then primary) | | Connection pooling | Separate PgBouncer deploy | Built-in option |# CloudNativePG Cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
instances: 3
storage:
size: 100Gi
storageClass: fast-ssd
backup:
barmanObjectStore:
destinationPath: s3://backups/app-db
s3Credentials:
accessKeyId: { name: s3-creds, key: ACCESS_KEY }
secretAccessKey: { name: s3-creds, key: SECRET_KEY }
retentionPolicy: "30d"
monitoring:
enablePodMonitor: true
Wiki Navigation¶
Prerequisites¶
- Database Operations on Kubernetes (Topic Pack, L2)
Related Content¶
- AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations on Kubernetes (Topic Pack, L2) — Database Operations
- Interview: Database Failover During Deploy (Scenario, L3) — Database Operations
- PostgreSQL Operations (Topic Pack, L2) — Database Operations
- Redis Operations (Topic Pack, L2) — Database Operations
- SQL Fundamentals (Topic Pack, L0) — Database Operations
- SQLite Operations & Internals (Topic Pack, L2) — Database Operations
- Skillcheck: Database Ops (Assessment, L2) — Database Operations
Pages that link here¶
- Database Operations - Skill Check
- Database Operations on Kubernetes
- Database Operations on Kubernetes - Primer
- Drills
- Level 7: SRE & Cloud Operations
- PostgreSQL Operations - Primer
- Postgresql
- Redis
- Redis Operations - Primer
- SQL Fundamentals
- SQL Fundamentals - Primer
- SQLite Operations & Internals - Primer
- Scenario: Database Failover During Deployment
- Sqlite
- Track: Cloud & FinOps