Database Operations - Street-Level Ops¶

Real-world patterns and gotchas from running databases on Kubernetes.

Quick Diagnosis Commands¶

# Check StatefulSet status
kubectl get statefulset -n grokdevops
kubectl describe statefulset postgres -n grokdevops

# Check PVC status (Bound = good, Pending = problem)
kubectl get pvc -n grokdevops

# Check storage class
kubectl get storageclass

# Connect to PostgreSQL
kubectl exec -it postgres-0 -n grokdevops -- psql -U postgres

# Check replication status (PostgreSQL)
kubectl exec -n grokdevops postgres-0 -- \
  psql -U postgres -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"

# Check connection count
kubectl exec -n grokdevops postgres-0 -- \
  psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
kubectl exec -n grokdevops postgres-0 -- \
  psql -U postgres -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"

# Check disk usage on PVC
kubectl exec -n grokdevops postgres-0 -- df -h /var/lib/postgresql/data

One-liner: pg_stat_replication is your single source of truth for replication health. If sent_lsn and replay_lsn diverge significantly, the replica is falling behind — check replica CPU, disk I/O, and network throughput. A replica that is hours behind is useless for failover.

Gotcha: PVC Pending (No Storage)¶

kubectl get pvc -n grokdevops
# NAME              STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# data-postgres-0   Pending                                      local-path

kubectl describe pvc data-postgres-0 -n grokdevops
# Events: waiting for a volume to be created

Causes: 1. StorageClass doesn't exist or is misconfigured 2. No available PVs (for static provisioning) 3. Node has insufficient disk space 4. CSI driver is not installed

Fix: Check storageclass exists, CSI driver pods are running, and node has disk.

Gotcha: StatefulSet Won't Scale Up¶

Pods created in order. If pod-0 isn't Ready, pod-1 won't be created.

# Check why pod-0 isn't ready
kubectl describe pod postgres-0 -n grokdevops
kubectl logs postgres-0 -n grokdevops

Common causes: PVC pending, readiness probe failing, init container stuck.

Under the hood: StatefulSets create pods sequentially (pod-0, then pod-1, then pod-2) and delete them in reverse order. This ordered startup is critical for databases that need a primary to be ready before replicas connect. The podManagementPolicy: Parallel option skips this ordering — never use it for databases unless your replication protocol handles concurrent startup gracefully.

Gotcha: Accidental PVC Deletion¶

# PROTECT critical PVCs with a finalizer
kubectl patch pvc data-postgres-0 -n grokdevops \
  -p '{"metadata":{"finalizers":["kubernetes.io/pvc-protection"]}}'

# This is actually set by default, but verify it's there
kubectl get pvc data-postgres-0 -n grokdevops -o jsonpath='{.metadata.finalizers}'

Pattern: Readiness Probe for Database¶

readinessProbe:
  exec:
    command:
      - pg_isready
      - -U
      - postgres
  initialDelaySeconds: 5
  periodSeconds: 10
livenessProbe:
  exec:
    command:
      - pg_isready
      - -U
      - postgres
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3

Pattern: Backup CronJob¶

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: grokdevops
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:16
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -U postgres -h postgres-0.postgres-headless -Fc grokdevops | \
                    gzip > /backups/grokdevops-$(date +%Y%m%d-%H%M%S).dump.gz
                  # Keep only last 7 days
                  find /backups -name "*.dump.gz" -mtime +7 -delete
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-secret
                      key: password
              volumeMounts:
                - name: backup-storage
                  mountPath: /backups
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: postgres-backups
          restartPolicy: OnFailure
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3

Pattern: Connection String Management¶

# ConfigMap for non-sensitive parts
apiVersion: v1
kind: ConfigMap
metadata:
  name: db-config
data:
  DB_HOST: postgres-0.postgres-headless.grokdevops
  DB_PORT: "5432"
  DB_NAME: grokdevops
  DB_SSLMODE: require

# Secret for credentials
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
stringData:
  DB_USER: grokdevops_app
  DB_PASSWORD: <from-vault-or-sealed-secret>
  DATABASE_URL: postgresql://grokdevops_app:<password>@postgres-0.postgres-headless:5432/grokdevops?sslmode=require

Emergency: Database Disk Full¶

# Check disk usage
kubectl exec -n grokdevops postgres-0 -- df -h /var/lib/postgresql/data

# Find large tables
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
  "SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;"

# Emergency: VACUUM to reclaim space
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c "VACUUM FULL;"

> **Gotcha:** `VACUUM FULL` locks the entire table for the duration of the operation — reads and writes are blocked. On a multi-GB table this can take hours, causing a full outage. Use regular `VACUUM` (no lock, runs concurrently) for routine maintenance. Only use `VACUUM FULL` during a planned maintenance window when the database is not serving traffic.

# Long-term: resize PVC (if storage class supports it)
kubectl patch pvc data-postgres-0 -n grokdevops \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Emergency: Too Many Connections¶

# Check current connections
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Kill idle connections
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"

# Long-term: deploy PgBouncer

Scale note: PostgreSQL creates a new process per connection (~10MB RSS each). At 500 connections, that is 5GB of memory just for connection overhead. PgBouncer in transaction pooling mode lets you serve 500 application connections through 20-50 actual database connections, dramatically reducing memory usage and context-switching overhead.

Quick Reference¶

Cheatsheet: Database-Ops