Database Operations Cheat Sheet¶
Gotcha: StatefulSet pod names are predictable (
postgres-0,postgres-1, etc.) and PVCs are bound 1:1 to pods. If you delete a StatefulSet pod, it gets recreated with the same name and reattaches to the same PVC. But if you scale down (e.g., 3 to 2), the PVC forpostgres-2is NOT deleted — it remains as an orphan. Scaling back up reattaches it, which can cause stale data issues if you expected a fresh volume.
StatefulSet Essentials¶
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres # Must match headless Service
replicas: 3
podManagementPolicy: OrderedReady # or Parallel
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
---
# Headless Service (required for stable DNS)
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
clusterIP: None # Headless
selector:
app: postgres
ports:
- port: 5432
DNS names: postgres-0.postgres.namespace.svc.cluster.local
PostgreSQL Commands¶
# Backup (custom format, compressed)
pg_dump -U postgres -Fc mydb > backup.dump
# Restore
pg_restore -U postgres -d mydb --clean backup.dump
# SQL dump (plain text)
pg_dumpall -U postgres > full_backup.sql
# Check replication status
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# Check replication lag
psql -U postgres -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"
# Connection count
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
Backup Strategies¶
| Strategy | RPO | Complexity | Use Case |
|---|---|---|---|
| pg_dump CronJob | Hours | Low | Small DBs, dev/staging |
| WAL archiving + PITR | Seconds | Medium | Production |
| Streaming replication | ~0 | Medium | HA with fast failover |
| Operator (CNPG) | Seconds | Low | Best practice |
Remember: PostgreSQL backup formats:
pg_dump -Fc(custom, compressed, most flexible — can restore individual tables),pg_dump -Fp(plain SQL, human-readable, large),pg_dumpall(all databases + roles, but plain text only). For production, always use-Fc— it is smaller and supports parallel restore withpg_restore -j 4.
Connection Pooling (PgBouncer)¶
| Pool Mode | When Connection Returns | Use Case |
|---|---|---|
transaction |
After each transaction | Most apps (default) |
session |
When client disconnects | Temp tables, LISTEN/NOTIFY |
statement |
After each statement | Simple queries only |
PVC Operations¶
# Expand PVC (StorageClass must allow it)
kubectl patch pvc data-postgres-0 -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Check expansion status
kubectl get pvc data-postgres-0 -o jsonpath='{.status.conditions}'
# Delete PVC (careful! data loss)
kubectl delete pvc data-postgres-0 # Only after backup!
# List PVCs and usage
kubectl get pvc -n database
kubectl exec postgres-0 -- df -h /var/lib/postgresql/data
CloudNativePG Operator¶
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
instances: 3
storage:
size: 100Gi
storageClass: fast-ssd
postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
backup:
barmanObjectStore:
destinationPath: s3://backups/app-db
s3Credentials:
accessKeyId: { name: s3-creds, key: ACCESS_KEY }
secretAccessKey: { name: s3-creds, key: SECRET_KEY }
retentionPolicy: "30d"
monitoring:
enablePodMonitor: true
Default trap: PgBouncer's default pool mode is
session, which offers no multiplexing benefit — each client holds a backend connection for the entire session. Switch totransactionmode for real connection savings. But beware:transactionmode breaksLISTEN/NOTIFY, prepared statements, and temporary tables because each query may use a different backend connection.
Key Metrics to Monitor¶
# Connections near limit
pg_stat_activity_count / pg_settings_max_connections > 0.8
# Replication lag
pg_replication_lag_seconds > 10
# Dead tuples (needs VACUUM)
pg_stat_user_tables_n_dead_tup > 10000
# Cache hit ratio (should be > 99%)
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)
# Disk usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
Failover Checklist¶
1. Detect primary failure (health check / operator)
2. Identify most up-to-date replica (least replication lag)
3. Promote replica: pg_ctl promote
4. Update Service to point to new primary
5. Reconfigure remaining replicas to follow new primary
6. Verify application connectivity
7. Investigate and repair failed node