Portal | Level: L2: Operations | Topics: Database Operations | Domain: Kubernetes

Database Operations - Skill Check¶

Mental model (bottom-up)¶

Databases are stateful workloads. They need stable identity, persistent storage, and ordered operations — which is why Kubernetes uses StatefulSets (not Deployments). Production databases need backup/restore, replication, failover, and connection pooling. Operators automate the hard parts.

Visual stack¶

[Application Pods    ]  connect via Service DNS
|
[Connection Pooler   ]  PgBouncer multiplexes connections
|
[Read-Write Service  ]  points to primary only
|
[StatefulSet         ]  db-0 (primary), db-1 (replica), db-2 (replica)
|
[PVCs                ]  each pod gets its own persistent volume
|
[WAL Archive / S3    ]  continuous backup for point-in-time recovery

Glossary¶

StatefulSet - K8s workload giving pods stable names, ordered start/stop, and persistent volumes
headless Service - Service with clusterIP: None providing DNS per pod (db-0.svc.ns)
WAL (Write-Ahead Log) - PostgreSQL transaction log enabling point-in-time recovery
PITR (Point-in-Time Recovery) - restore database to any second using base backup + WAL replay
PgBouncer - lightweight connection pooler sitting between apps and PostgreSQL
pg_dump / pg_restore - logical backup/restore (SQL-level, portable)
CloudNativePG - Kubernetes operator automating PostgreSQL lifecycle

Core questions (easy -> hard)¶

Why StatefulSet and not Deployment for databases?
Stable pod names (db-0, db-1), ordered start/stop, per-pod PVCs that survive rescheduling.
How do you back up a database in Kubernetes?
pg_dump CronJob (simple), WAL archiving to S3 (production), operator-managed (best).
What's the difference between logical and physical backups?
Logical (pg_dump): SQL-level, portable, slow for large DBs. Physical (pg_basebackup + WAL): file-level, fast, supports PITR.
Why use a connection pooler?
PostgreSQL forks a process per connection. 500 pods x 5 conns = 2500 backend connections. PgBouncer multiplexes to ~20.
How does failover work with an operator?
Operator detects primary failure, promotes most current replica, reconfigures remaining replicas, updates Service endpoints. RTO: ~10-30s.
You need to restore to 5 minutes ago. Walk through it.
Stop app traffic. Restore latest base backup. Configure recovery_target_time. Apply WAL segments up to that time. Promote. Verify data.
PVC is 95% full. What do you do?
Patch PVC to larger size (StorageClass must allow expansion). May need pod restart for filesystem resize. PVCs can never shrink.

Prerequisites¶

Database Operations on Kubernetes (Topic Pack, L2)

AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
Database Operations on Kubernetes (Topic Pack, L2) — Database Operations
Database Ops Drills (Drill, L2) — Database Operations
Interview: Database Failover During Deploy (Scenario, L3) — Database Operations
PostgreSQL Operations (Topic Pack, L2) — Database Operations
Redis Operations (Topic Pack, L2) — Database Operations
SQL Fundamentals (Topic Pack, L0) — Database Operations
SQLite Operations & Internals (Topic Pack, L2) — Database Operations

Database Operations - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Core questions (easy -> hard)¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Database Operations - Skill Check¶

Mental model (bottom-up)¶

Visual stack¶

Glossary¶

Core questions (easy -> hard)¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶