Portal | Level: L2: Operations | Topics: Database Operations | Domain: Kubernetes
Database Operations - Skill Check¶
Mental model (bottom-up)¶
Databases are stateful workloads. They need stable identity, persistent storage, and ordered operations — which is why Kubernetes uses StatefulSets (not Deployments). Production databases need backup/restore, replication, failover, and connection pooling. Operators automate the hard parts.
Visual stack¶
[Application Pods ] connect via Service DNS
|
[Connection Pooler ] PgBouncer multiplexes connections
|
[Read-Write Service ] points to primary only
|
[StatefulSet ] db-0 (primary), db-1 (replica), db-2 (replica)
|
[PVCs ] each pod gets its own persistent volume
|
[WAL Archive / S3 ] continuous backup for point-in-time recovery
Glossary¶
- StatefulSet - K8s workload giving pods stable names, ordered start/stop, and persistent volumes
- headless Service - Service with
clusterIP: Noneproviding DNS per pod (db-0.svc.ns) - WAL (Write-Ahead Log) - PostgreSQL transaction log enabling point-in-time recovery
- PITR (Point-in-Time Recovery) - restore database to any second using base backup + WAL replay
- PgBouncer - lightweight connection pooler sitting between apps and PostgreSQL
- pg_dump / pg_restore - logical backup/restore (SQL-level, portable)
- CloudNativePG - Kubernetes operator automating PostgreSQL lifecycle
Core questions (easy -> hard)¶
- Why StatefulSet and not Deployment for databases?
- Stable pod names (db-0, db-1), ordered start/stop, per-pod PVCs that survive rescheduling.
- How do you back up a database in Kubernetes?
- pg_dump CronJob (simple), WAL archiving to S3 (production), operator-managed (best).
- What's the difference between logical and physical backups?
- Logical (pg_dump): SQL-level, portable, slow for large DBs. Physical (pg_basebackup + WAL): file-level, fast, supports PITR.
- Why use a connection pooler?
- PostgreSQL forks a process per connection. 500 pods x 5 conns = 2500 backend connections. PgBouncer multiplexes to ~20.
- How does failover work with an operator?
- Operator detects primary failure, promotes most current replica, reconfigures remaining replicas, updates Service endpoints. RTO: ~10-30s.
- You need to restore to 5 minutes ago. Walk through it.
- Stop app traffic. Restore latest base backup. Configure
recovery_target_time. Apply WAL segments up to that time. Promote. Verify data. - PVC is 95% full. What do you do?
- Patch PVC to larger size (StorageClass must allow expansion). May need pod restart for filesystem resize. PVCs can never shrink.
Wiki Navigation¶
Prerequisites¶
- Database Operations on Kubernetes (Topic Pack, L2)
Related Content¶
- AWS Database Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations Flashcards (CLI) (flashcard_deck, L1) — Database Operations
- Database Operations on Kubernetes (Topic Pack, L2) — Database Operations
- Database Ops Drills (Drill, L2) — Database Operations
- Interview: Database Failover During Deploy (Scenario, L3) — Database Operations
- PostgreSQL Operations (Topic Pack, L2) — Database Operations
- Redis Operations (Topic Pack, L2) — Database Operations
- SQL Fundamentals (Topic Pack, L0) — Database Operations
- SQLite Operations & Internals (Topic Pack, L2) — Database Operations