Database Operations - Street-Level Ops¶
Real-world patterns and gotchas from running databases on Kubernetes.
Quick Diagnosis Commands¶
# Check StatefulSet status
kubectl get statefulset -n grokdevops
kubectl describe statefulset postgres -n grokdevops
# Check PVC status (Bound = good, Pending = problem)
kubectl get pvc -n grokdevops
# Check storage class
kubectl get storageclass
# Connect to PostgreSQL
kubectl exec -it postgres-0 -n grokdevops -- psql -U postgres
# Check replication status (PostgreSQL)
kubectl exec -n grokdevops postgres-0 -- \
psql -U postgres -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"
# Check connection count
kubectl exec -n grokdevops postgres-0 -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Check database size
kubectl exec -n grokdevops postgres-0 -- \
psql -U postgres -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database;"
# Check disk usage on PVC
kubectl exec -n grokdevops postgres-0 -- df -h /var/lib/postgresql/data
One-liner:
pg_stat_replicationis your single source of truth for replication health. Ifsent_lsnandreplay_lsndiverge significantly, the replica is falling behind — check replica CPU, disk I/O, and network throughput. A replica that is hours behind is useless for failover.
Gotcha: PVC Pending (No Storage)¶
kubectl get pvc -n grokdevops
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# data-postgres-0 Pending local-path
kubectl describe pvc data-postgres-0 -n grokdevops
# Events: waiting for a volume to be created
Causes: 1. StorageClass doesn't exist or is misconfigured 2. No available PVs (for static provisioning) 3. Node has insufficient disk space 4. CSI driver is not installed
Fix: Check storageclass exists, CSI driver pods are running, and node has disk.
Gotcha: StatefulSet Won't Scale Up¶
Pods created in order. If pod-0 isn't Ready, pod-1 won't be created.
# Check why pod-0 isn't ready
kubectl describe pod postgres-0 -n grokdevops
kubectl logs postgres-0 -n grokdevops
Common causes: PVC pending, readiness probe failing, init container stuck.
Under the hood: StatefulSets create pods sequentially (pod-0, then pod-1, then pod-2) and delete them in reverse order. This ordered startup is critical for databases that need a primary to be ready before replicas connect. The
podManagementPolicy: Paralleloption skips this ordering — never use it for databases unless your replication protocol handles concurrent startup gracefully.
Gotcha: Accidental PVC Deletion¶
# PROTECT critical PVCs with a finalizer
kubectl patch pvc data-postgres-0 -n grokdevops \
-p '{"metadata":{"finalizers":["kubernetes.io/pvc-protection"]}}'
# This is actually set by default, but verify it's there
kubectl get pvc data-postgres-0 -n grokdevops -o jsonpath='{.metadata.finalizers}'
Pattern: Readiness Probe for Database¶
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
Pattern: Backup CronJob¶
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: grokdevops
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command:
- /bin/sh
- -c
- |
pg_dump -U postgres -h postgres-0.postgres-headless -Fc grokdevops | \
gzip > /backups/grokdevops-$(date +%Y%m%d-%H%M%S).dump.gz
# Keep only last 7 days
find /backups -name "*.dump.gz" -mtime +7 -delete
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: postgres-backups
restartPolicy: OnFailure
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
Pattern: Connection String Management¶
# ConfigMap for non-sensitive parts
apiVersion: v1
kind: ConfigMap
metadata:
name: db-config
data:
DB_HOST: postgres-0.postgres-headless.grokdevops
DB_PORT: "5432"
DB_NAME: grokdevops
DB_SSLMODE: require
# Secret for credentials
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
stringData:
DB_USER: grokdevops_app
DB_PASSWORD: <from-vault-or-sealed-secret>
DATABASE_URL: postgresql://grokdevops_app:<password>@postgres-0.postgres-headless:5432/grokdevops?sslmode=require
Emergency: Database Disk Full¶
# Check disk usage
kubectl exec -n grokdevops postgres-0 -- df -h /var/lib/postgresql/data
# Find large tables
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
"SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;"
# Emergency: VACUUM to reclaim space
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c "VACUUM FULL;"
> **Gotcha:** `VACUUM FULL` locks the entire table for the duration of the operation — reads and writes are blocked. On a multi-GB table this can take hours, causing a full outage. Use regular `VACUUM` (no lock, runs concurrently) for routine maintenance. Only use `VACUUM FULL` during a planned maintenance window when the database is not serving traffic.
# Long-term: resize PVC (if storage class supports it)
kubectl patch pvc data-postgres-0 -n grokdevops \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
Emergency: Too Many Connections¶
# Check current connections
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
"SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Kill idle connections
kubectl exec -n grokdevops postgres-0 -- psql -U postgres -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
# Long-term: deploy PgBouncer
Scale note: PostgreSQL creates a new process per connection (~10MB RSS each). At 500 connections, that is 5GB of memory just for connection overhead. PgBouncer in transaction pooling mode lets you serve 500 application connections through 20-50 actual database connections, dramatically reducing memory usage and context-switching overhead.
Quick Reference¶
- Cheatsheet: Database-Ops