Skip to content

Portal | Level: L2: Operations | Topics: Database Operations | Domain: Kubernetes

Database Operations - Skill Check

Mental model (bottom-up)

Databases are stateful workloads. They need stable identity, persistent storage, and ordered operations — which is why Kubernetes uses StatefulSets (not Deployments). Production databases need backup/restore, replication, failover, and connection pooling. Operators automate the hard parts.

Visual stack

[Application Pods    ]  connect via Service DNS
|
[Connection Pooler   ]  PgBouncer multiplexes connections
|
[Read-Write Service  ]  points to primary only
|
[StatefulSet         ]  db-0 (primary), db-1 (replica), db-2 (replica)
|
[PVCs                ]  each pod gets its own persistent volume
|
[WAL Archive / S3    ]  continuous backup for point-in-time recovery

Glossary

  • StatefulSet - K8s workload giving pods stable names, ordered start/stop, and persistent volumes
  • headless Service - Service with clusterIP: None providing DNS per pod (db-0.svc.ns)
  • WAL (Write-Ahead Log) - PostgreSQL transaction log enabling point-in-time recovery
  • PITR (Point-in-Time Recovery) - restore database to any second using base backup + WAL replay
  • PgBouncer - lightweight connection pooler sitting between apps and PostgreSQL
  • pg_dump / pg_restore - logical backup/restore (SQL-level, portable)
  • CloudNativePG - Kubernetes operator automating PostgreSQL lifecycle

Core questions (easy -> hard)

  • Why StatefulSet and not Deployment for databases?
  • Stable pod names (db-0, db-1), ordered start/stop, per-pod PVCs that survive rescheduling.
  • How do you back up a database in Kubernetes?
  • pg_dump CronJob (simple), WAL archiving to S3 (production), operator-managed (best).
  • What's the difference between logical and physical backups?
  • Logical (pg_dump): SQL-level, portable, slow for large DBs. Physical (pg_basebackup + WAL): file-level, fast, supports PITR.
  • Why use a connection pooler?
  • PostgreSQL forks a process per connection. 500 pods x 5 conns = 2500 backend connections. PgBouncer multiplexes to ~20.
  • How does failover work with an operator?
  • Operator detects primary failure, promotes most current replica, reconfigures remaining replicas, updates Service endpoints. RTO: ~10-30s.
  • You need to restore to 5 minutes ago. Walk through it.
  • Stop app traffic. Restore latest base backup. Configure recovery_target_time. Apply WAL segments up to that time. Promote. Verify data.
  • PVC is 95% full. What do you do?
  • Patch PVC to larger size (StorageClass must allow expansion). May need pod restart for filesystem resize. PVCs can never shrink.

Wiki Navigation

Prerequisites