Portal | Level: L2: Operations | Topics: Database Operations, Kubernetes Storage | Domain: Kubernetes

Database Operations on Kubernetes - Primer¶

Why This Matters¶

Databases are the hardest workload to run on Kubernetes. They're stateful, they need persistent storage that survives pod restarts, they require careful backup/restore procedures, and a misconfigured migration can take down production. Every SRE interview will ask about running stateful workloads. Understanding StatefulSets, PVCs, backup strategies, and connection pooling separates ops engineers from script runners.

Under the hood: StatefulSets were added in Kubernetes 1.5 (December 2016) as "PetSets" -- the name reflected the "pets vs cattle" metaphor. Databases are pets: they have names, stable identities, and you care when one dies. The rename to StatefulSet happened before GA because the community felt "PetSet" was too informal for a core API object.

StatefulSets vs Deployments¶

Feature	Deployment	StatefulSet
Pod names	Random suffix (app-7b9f4)	Ordered index (app-0, app-1, app-2)
Scaling	All pods equal	Ordered creation/deletion (0->1->2)
Storage	Shared or none	Per-pod PVC (stable across restarts)
Network identity	Via Service (random pod)	Headless Service (pod-0.svc, pod-1.svc)
Use case	Stateless apps	Databases, message queues, distributed stores

StatefulSet Example¶

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: grokdevops
spec:
  serviceName: postgres-headless
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: local-path
        resources:
          requests:
            storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  namespace: grokdevops
spec:
  clusterIP: None
  selector:
    app: postgres
  ports:
    - port: 5432

Stable Network Identity¶

Each pod gets a DNS name: <pod-name>.<headless-service>.<namespace>.svc.cluster.local

postgres-0.postgres-headless.grokdevops.svc.cluster.local
postgres-1.postgres-headless.grokdevops.svc.cluster.local
postgres-2.postgres-headless.grokdevops.svc.cluster.local

Persistent Volume Claims (PVCs)¶

Volume Lifecycle¶

PV Created -> PVC Binds -> Pod Mounts -> Pod Deleted -> PVC Remains -> PV Retained/Deleted

Key behaviors: - volumeClaimTemplates create one PVC per pod (e.g., data-postgres-0) - PVCs survive pod restarts and rescheduling - Deleting a StatefulSet does NOT delete PVCs (data is preserved) - PVCs must be manually deleted to reclaim storage

Reclaim Policies¶

Policy	Behavior
Retain	PV kept after PVC deleted (manual cleanup)
Delete	PV deleted when PVC deleted (cloud volumes destroyed)
Recycle	Deprecated. Don't use.

# Check PVC status
kubectl get pvc -n grokdevops

# Check what PV a PVC is bound to
kubectl get pvc data-postgres-0 -n grokdevops -o jsonpath='{.spec.volumeName}'

Backup Strategies¶

Logical Backups (pg_dump / mysqldump)¶

# PostgreSQL
kubectl exec -n grokdevops postgres-0 -- \
  pg_dump -U postgres -Fc mydb > backup-$(date +%Y%m%d).dump

# MySQL
kubectl exec -n grokdevops mysql-0 -- \
  mysqldump -u root -p"$MYSQL_ROOT_PASSWORD" --all-databases > backup.sql

Pros: Portable, human-readable, can restore individual tables. Cons: Slow for large databases, locks tables during backup.

Physical Backups (WAL Archiving / Binary)¶

# PostgreSQL: continuous WAL archiving with pgBackRest
kubectl exec -n grokdevops postgres-0 -- pgbackrest backup --type=full

# Point-in-time recovery
kubectl exec -n grokdevops postgres-0 -- pgbackrest restore --type=time \
  --target="2024-01-15 14:30:00"

Pros: Fast, supports point-in-time recovery (PITR). Cons: Not portable across major versions, more complex setup.

Volume Snapshots¶

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snapshot-20240115
  namespace: grokdevops
spec:
  volumeSnapshotClassName: csi-snapshot
  source:
    persistentVolumeClaimName: data-postgres-0

Pros: Fast (storage-level), consistent if quiesced first. Cons: Requires CSI driver support, may need fsfreeze for consistency.

Database Migrations¶

Gotcha: volumeClaimTemplates create PVCs named <template-name>-<pod-name> (e.g., data-postgres-0). If you rename a StatefulSet, the new pods look for PVCs with the new name and won't find the old ones. Your data is still on disk but the binding is broken. This is why StatefulSet renames are effectively destructive operations.

Why Migrations Are Dangerous¶

Schema changes can lock tables for hours
Failed migrations can leave the database in an inconsistent state
Rollback is often not automatic
Migrations must be backward-compatible if you do rolling deployments

Safe Migration Pattern¶

1. Deploy new code that works with BOTH old and new schema
2. Run migration (add column, create table)
3. Deploy code that uses new schema
4. Drop old columns/tables in a later migration

Init Container for Migrations¶

initContainers:
  - name: migrate
    image: ghcr.io/org/grokdevops:v1.2.3
    command: ["python", "-m", "alembic", "upgrade", "head"]
    env:
      - name: DATABASE_URL
        valueFrom:
          secretKeyRef:
            name: db-credentials
            key: url

Migration as a Job¶

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v1-2-3
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: ghcr.io/org/grokdevops:v1.2.3
          command: ["python", "-m", "alembic", "upgrade", "head"]
      restartPolicy: Never
  backoffLimit: 3

Connection Pooling¶

Why You Need It¶

PostgreSQL forks a process per connection. At 100+ connections, memory usage spikes and performance degrades. Connection pooling multiplexes many app connections onto fewer database connections.

Remember: Rule of thumb for PostgreSQL max connections: "Cores times 2, plus disk spindles." A 4-core server with SSDs should have ~10-20 actual database connections, not 200. PgBouncer sits in front and multiplexes hundreds of application connections onto that small pool. More connections does not mean more throughput -- past a point, it means more contention and context switching.

PgBouncer¶

# PgBouncer sidecar pattern
containers:
  - name: app
    image: ghcr.io/org/grokdevops:latest
    env:
      - name: DATABASE_URL
        value: postgresql://user:pass@localhost:6432/mydb  # Connect to PgBouncer
  - name: pgbouncer
    image: bitnami/pgbouncer:latest
    ports:
      - containerPort: 6432
    env:
      - name: PGBOUNCER_DATABASE
        value: mydb
      - name: POSTGRESQL_HOST
        value: postgres-0.postgres-headless
      - name: PGBOUNCER_POOL_MODE
        value: transaction  # Best for most apps
      - name: PGBOUNCER_MAX_CLIENT_CONN
        value: "200"
      - name: PGBOUNCER_DEFAULT_POOL_SIZE
        value: "20"

Pool Modes¶

Mode	Behavior	Best for
session	One DB conn per client session	Legacy apps using prepared statements
transaction	Conn returned after each transaction	Most web apps (recommended)
statement	Conn returned after each statement	Simple read-only queries

Database Operators¶

Instead of managing StatefulSets manually, use a database operator:

Operator	Database	Features
CloudNativePG	PostgreSQL	HA, backup to S3, PITR, connection pooling
Zalando Postgres Operator	PostgreSQL	Patroni-based HA, logical backups
Percona Operator	MySQL/MongoDB	HA, backups, proxying
CrunchyData PGO	PostgreSQL	HA, backups, monitoring

CloudNativePG Example¶

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: grokdevops-db
  namespace: grokdevops
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClass: local-path
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/grokdevops
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "30d"
  monitoring:
    enablePodMonitor: true

Common Pitfalls¶

No backup testing — An untested backup is not a backup. Regularly restore to a test environment.
PVC deletion — Deleting a StatefulSet's PVCs deletes your data permanently.
Connection exhaustion — Apps open too many connections. Use PgBouncer.
Migration without rollback plan — Always have a way to undo a migration.
Storage class mismatch — Using ReadWriteMany when the storage only supports ReadWriteOnce.
No resource limits on DB pods — Databases will consume all available memory. Set limits.

War story: A common Kubernetes database disaster: the team deletes a StatefulSet thinking they'll recreate it, not realizing that kubectl delete statefulset postgres --cascade=foreground also deletes the pods (expected) but the PVCs survive (also expected). The danger is when someone then manually deletes the "orphaned" PVCs to "clean up" -- that's the actual data destruction. Always verify PVC contents before deletion with kubectl exec into a temporary pod that mounts the volume.

Prerequisites¶

Kubernetes Ops (Production) (Topic Pack, L2)

Next Steps¶

Database Ops Drills (Drill, L2)
Distributed Systems Fundamentals (Topic Pack, L2)
MongoDB Operations (Topic Pack, L1)
MySQL / MariaDB Operations (Topic Pack, L1)
Skillcheck: Database Ops (Assessment, L2)

AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Case Study: Persistent Volume Stuck Terminating (Case Study, L2) — Kubernetes Storage
Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Database Ops Drills (Drill, L2) — Database Operations
Interview: Database Failover During Deploy (Scenario, L3) — Database Operations
K8s Storage (Topic Pack, L1) — Kubernetes Storage
Kubernetes Exercises (Quest Ladder) (CLI) (Exercise Set, L1) — Kubernetes Storage
Kubernetes Storage Flashcards (CLI) (flashcard_deck, L1) — Kubernetes Storage
PostgreSQL Operations (Topic Pack, L2) — Database Operations
Redis Operations (Topic Pack, L2) — Database Operations

Database Operations on Kubernetes - Primer¶

Why This Matters¶

StatefulSets vs Deployments¶

StatefulSet Example¶

Stable Network Identity¶

Persistent Volume Claims (PVCs)¶

Volume Lifecycle¶

Reclaim Policies¶

Backup Strategies¶

Logical Backups (pg_dump / mysqldump)¶

Physical Backups (WAL Archiving / Binary)¶

Volume Snapshots¶

Database Migrations¶

Why Migrations Are Dangerous¶

Safe Migration Pattern¶

Init Container for Migrations¶

Migration as a Job¶

Connection Pooling¶

Why You Need It¶

PgBouncer¶

Pool Modes¶

Database Operators¶

CloudNativePG Example¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Pages that link here¶

Database Operations on Kubernetes - Primer¶

Why This Matters¶

StatefulSets vs Deployments¶

StatefulSet Example¶

Stable Network Identity¶

Persistent Volume Claims (PVCs)¶

Volume Lifecycle¶

Reclaim Policies¶

Backup Strategies¶

Logical Backups (pg_dump / mysqldump)¶

Physical Backups (WAL Archiving / Binary)¶

Volume Snapshots¶

Database Migrations¶

Why Migrations Are Dangerous¶

Safe Migration Pattern¶

Init Container for Migrations¶

Migration as a Job¶

Connection Pooling¶

Why You Need It¶

PgBouncer¶

Pool Modes¶

Database Operators¶

CloudNativePG Example¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Next Steps¶

Related Content¶

Pages that link here¶