Skip to content

Database Ops

← Back to all decks

26 cards — 🟢 3 easy | 🟡 6 medium | 🔴 3 hard

🟢 Easy (3)

1. What is the difference between logical and physical database backups?

Show answer Logical backups (pg_dump, mysqldump) export SQL statements — portable, slower, allow selective restore. Physical backups (pg_basebackup, xtrabackup) copy data files — faster for large databases, must restore to same major version. Use logical for small DBs and cross-version migration, physical for large DBs and fast recovery.

Remember: pg_dump = logical backup (SQL). pg_basebackup = physical backup (binary). Logical is portable, physical is faster for large databases.

2. Why do databases need connection pooling?

Show answer Each database connection consumes significant memory (PostgreSQL ~10MB per connection). Without pooling, microservices with many replicas quickly exhaust max_connections. Poolers (PgBouncer, ProxySQL) multiplex many application connections over fewer database connections. PgBouncer in transaction mode is the most common PostgreSQL pattern.

Remember: connection pools (PgBouncer, ProxySQL) reuse database connections instead of creating new ones per request. Reduces connection overhead from thousands to dozens.

3. What is Point-in-Time Recovery (PITR) and how does it work?

Show answer PITR restores a database to any moment by replaying WAL (Write-Ahead Log) segments on top of a base backup. Take regular base backups and continuously archive WAL segments. To recover: restore the base backup, then replay WAL up to the target timestamp. This is how you recover from accidental data deletion.

Remember: database operations is about uptime and data integrity. Every change (schema, config, version) should be planned, tested, documented, and reversible.

🟡 Medium (6)

1. What is the difference between synchronous and asynchronous replication?

Show answer Asynchronous: primary commits without waiting for replica acknowledgment — faster writes, but replica can lag (risk of data loss on failover). Synchronous: primary waits for at least one replica to confirm write — zero data loss, but higher write latency. Choose based on RPO requirements.

Remember: synchronous replication = zero data loss but higher latency. Asynchronous = faster but risk of data loss on failover. Most read replicas use async.

2. What are the top 5 database metrics to monitor?

Show answer 1) Query latency (p50/p95/p99). 2) Connection count vs max_connections. 3) Replication lag (seconds behind primary). 4) Cache hit ratio (should be >99% for OLTP). 5) Dead tuples / table bloat. Also: disk space, WAL generation rate, and lock wait time.

Remember: database operations is about uptime and data integrity. Every change (schema, config, version) should be planned, tested, documented, and reversible.

3. How do you run a zero-downtime schema migration?

Show answer Use expand-and-contract pattern: 1) Add new column (no lock on reads). 2) Backfill data in batches. 3) Deploy code that writes to both old and new columns. 4) Deploy code that reads from new column. 5) Drop old column. Tools: pg_repack, gh-ost, pt-online-schema-change. Never ALTER TABLE ADD COLUMN with DEFAULT on large tables in older PostgreSQL.

Remember: database migrations should be backwards-compatible. Add columns (safe), rename columns (unsafe — app breaks). Use expand-contract: add new, migrate data, remove old.

4. What is VACUUM in PostgreSQL and why does it matter?

Show answer VACUUM reclaims space from dead tuples created by updates and deletes (MVCC). Without it, tables bloat, queries slow down, and you risk transaction ID wraparound (which forces a full-table freeze). Autovacuum runs automatically but may need tuning for high-write tables. VACUUM FULL rewrites the table but takes an exclusive lock.

Remember: VACUUM reclaims dead rows from UPDATE/DELETE (PostgreSQL MVCC). VACUUM FULL rewrites the table (locks it). autovacuum handles this automatically but may need tuning for write-heavy tables.

5. How do you secure database access in production?

Show answer 1) No direct access — use bastion/jump host or IAM-based auth. 2) Enforce TLS for all connections. 3) Separate read-only credentials for reporting. 4) Rotate credentials automatically (Vault, AWS Secrets Manager). 5) Enable audit logging. 6) Restrict network access with security groups/firewall rules. 7) Never use the superuser account for applications.

Remember: database operations is about uptime and data integrity. Every change (schema, config, version) should be planned, tested, documented, and reversible.

6. What is replication lag and how do you handle it?

Show answer Replication lag is the delay between a write on the primary and its appearance on replicas. Causes: replica hardware slower than primary, heavy write load, long-running queries on replica, network issues. Mitigation: route critical reads to primary, monitor lag (pg_stat_replication), alert if lag exceeds SLA, use synchronous replication for zero-lag requirements.

Remember: synchronous replication = zero data loss but higher latency. Asynchronous = faster but risk of data loss on failover. Most read replicas use async.

🔴 Hard (3)

1. How does automatic database failover work?

Show answer A consensus mechanism (Patroni, Orchestrator, RDS Multi-AZ) monitors the primary. If the primary is unreachable for a configurable timeout, the system promotes the most up-to-date replica to primary, updates DNS/endpoints, and redirects connections. Risks: split-brain (fencing/STONITH prevents), data loss if async replica was behind.

Remember: database operations is about uptime and data integrity. Every change (schema, config, version) should be planned, tested, documented, and reversible.

2. How do you diagnose a slow query in production?

Show answer 1) Check pg_stat_activity for long-running queries. 2) Run EXPLAIN ANALYZE on the slow query. 3) Look for sequential scans on large tables (missing index), nested loop joins on large sets, and inaccurate row estimates. 4) Check for lock contention (pg_locks). 5) Check if autovacuum is keeping statistics current (ANALYZE).

Remember: slow query debugging: EXPLAIN ANALYZE (execution plan), check for sequential scans (missing indexes), look at row estimates vs. actuals (stale statistics).

3. What is the difference between vertical and horizontal database scaling?

Show answer Vertical: bigger machine (more CPU, RAM, faster disks) — simple but has limits. Horizontal: distribute data across multiple nodes via sharding (partition by key), read replicas (offload reads), or distributed databases (CockroachDB, Vitess). Horizontal adds operational complexity but scales further. Most teams should optimize queries and add read replicas before sharding.

Remember: database operations is about uptime and data integrity. Every change (schema, config, version) should be planned, tested, documented, and reversible.