Skip to content

GrokDevOps Wiki

Database Ops Cheatsheet

grokdatum/grokdevops

Database Operations Cheat Sheet¶

Gotcha: StatefulSet pod names are predictable (postgres-0, postgres-1, etc.) and PVCs are bound 1:1 to pods. If you delete a StatefulSet pod, it gets recreated with the same name and reattaches to the same PVC. But if you scale down (e.g., 3 to 2), the PVC for postgres-2 is NOT deleted — it remains as an orphan. Scaling back up reattaches it, which can cause stale data issues if you expected a fresh volume.

StatefulSet Essentials¶

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres    # Must match headless Service
  replicas: 3
  podManagementPolicy: OrderedReady  # or Parallel
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
---
# Headless Service (required for stable DNS)
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  clusterIP: None   # Headless
  selector:
    app: postgres
  ports:
  - port: 5432

DNS names: postgres-0.postgres.namespace.svc.cluster.local

PostgreSQL Commands¶

# Backup (custom format, compressed)
pg_dump -U postgres -Fc mydb > backup.dump

# Restore
pg_restore -U postgres -d mydb --clean backup.dump

# SQL dump (plain text)
pg_dumpall -U postgres > full_backup.sql

# Check replication status
psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# Check replication lag
psql -U postgres -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

# Connection count
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

Backup Strategies¶

Strategy	RPO	Complexity	Use Case
pg_dump CronJob	Hours	Low	Small DBs, dev/staging
WAL archiving + PITR	Seconds	Medium	Production
Streaming replication	~0	Medium	HA with fast failover
Operator (CNPG)	Seconds	Low	Best practice

Remember: PostgreSQL backup formats: pg_dump -Fc (custom, compressed, most flexible — can restore individual tables), pg_dump -Fp (plain SQL, human-readable, large), pg_dumpall (all databases + roles, but plain text only). For production, always use -Fc — it is smaller and supports parallel restore with pg_restore -j 4.

Connection Pooling (PgBouncer)¶

App (1000 connections) → PgBouncer (20 backend connections) → PostgreSQL

Pool Mode	When Connection Returns	Use Case
`transaction`	After each transaction	Most apps (default)
`session`	When client disconnects	Temp tables, LISTEN/NOTIFY
`statement`	After each statement	Simple queries only

PVC Operations¶

# Expand PVC (StorageClass must allow it)
kubectl patch pvc data-postgres-0 -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Check expansion status
kubectl get pvc data-postgres-0 -o jsonpath='{.status.conditions}'

# Delete PVC (careful! data loss)
kubectl delete pvc data-postgres-0   # Only after backup!

# List PVCs and usage
kubectl get pvc -n database
kubectl exec postgres-0 -- df -h /var/lib/postgresql/data

CloudNativePG Operator¶

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-db
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: fast-ssd

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"

  backup:
    barmanObjectStore:
      destinationPath: s3://backups/app-db
      s3Credentials:
        accessKeyId: { name: s3-creds, key: ACCESS_KEY }
        secretAccessKey: { name: s3-creds, key: SECRET_KEY }
    retentionPolicy: "30d"

  monitoring:
    enablePodMonitor: true

Default trap: PgBouncer's default pool mode is session, which offers no multiplexing benefit — each client holds a backend connection for the entire session. Switch to transaction mode for real connection savings. But beware: transaction mode breaks LISTEN/NOTIFY, prepared statements, and temporary tables because each query may use a different backend connection.

Key Metrics to Monitor¶

# Connections near limit
pg_stat_activity_count / pg_settings_max_connections > 0.8

# Replication lag
pg_replication_lag_seconds > 10

# Dead tuples (needs VACUUM)
pg_stat_user_tables_n_dead_tup > 10000

# Cache hit ratio (should be > 99%)
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)

# Disk usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85

Failover Checklist¶

1. Detect primary failure (health check / operator)
2. Identify most up-to-date replica (least replication lag)
3. Promote replica: pg_ctl promote
4. Update Service to point to new primary
5. Reconfigure remaining replicas to follow new primary
6. Verify application connectivity
7. Investigate and repair failed node

Pages that link here¶