Quiz: Database Internals¶

14 questions

L1 (7 questions)¶

1. What are the ACID properties of a database transaction?

Show answer

Atomicity: all or nothing. Consistency: transaction leaves DB in valid state. Isolation: concurrent transactions don't interfere. Durability: committed data survives crashes. Together they guarantee reliable transaction processing.

2. Why do database indexes speed up reads but slow down writes?

Show answer

Indexes are additional data structures (B-trees, hash maps) that the DB must update on every INSERT/UPDATE/DELETE. More indexes = more write overhead. But queries can use indexes to avoid full table scans, dramatically improving read performance.

3. What causes database lock contention and how do you detect it?

Show answer

Lock contention occurs when multiple transactions compete for the same rows/tables. Symptoms: slow queries, timeouts, deadlocks. Detect: pg_stat_activity (PostgreSQL), SHOW PROCESSLIST (MySQL), or SHOW ENGINE INNODB STATUS for deadlock info.

4. What are common operational database mistakes?

Show answer

1. No backups or untested backups.
2. Running ALTER TABLE on a large table during peak hours (locks).
3. Missing indexes on frequently queried columns.
4. Not monitoring replication lag.
5. Storing secrets in plaintext columns.
6. No connection pooling (exhausting max_connections).

5. Your primary database has async replication to a replica with 10 minutes of lag. The primary crashes and you promote the replica. How much committed data could be lost?

Show answer

Up to 10 minutes of committed transactions can be lost. With asynchronous replication, the primary acknowledges commits before the replica receives them. The replication lag represents the window of data that exists only on the primary. When the primary crashes, any transactions committed but not yet replicated are lost. This is the fundamental trade-off of async replication: better write performance (no waiting for replica acknowledgment) at the cost of potential data loss during failover. For zero data loss: use synchronous replication (every commit waits for replica confirmation), but this adds write latency equal to the network round-trip time. Semi-sync is a middle ground — waits for at least one replica to acknowledge.

6. Two concurrent transactions deadlock: transaction A locks row 1 then requests row 2, transaction B locks row 2 then requests row 1. The database kills one. How do you prevent this permanently?

Show answer

Enforce a consistent lock ordering across all application code — always lock rows in the same order (e.g., by primary key ascending). If transaction A and B both lock row 1 first, then row 2, they can never deadlock because they never hold conflicting locks. Implementation:
1. Sort the rows to be modified by primary key before beginning the transaction.
2. Use SELECT ... FOR UPDATE with ORDER BY to acquire locks in a deterministic sequence.
3. For complex cases, use advisory locks to serialize access to related row groups. Detection: monitor pg_stat_activity for lock waits, set deadlock_timeout (default 1s) to detect deadlocks quickly, and log_lock_waits=on to record waits in the PostgreSQL log.

7. All connection pool slots are occupied but pg_stat_activity shows most connections are idle. New requests queue and eventually time out. What is happening and how do you fix it?

Show answer

Connection leak: application code acquires connections from the pool but fails to return them (missing connection.close() or context manager). Connections stay open and idle, consuming a pool slot indefinitely. Diagnosis: check pg_stat_activity for connections in 'idle' state with old query_start times — these are leaked connections sitting unused. Fix:
1. Immediate: restart the application to release all connections.
2. Set idle_in_transaction_session_timeout in PostgreSQL to kill connections that are idle in a transaction too long.
3. Configure the pool's maxLifetime and idleTimeout to reclaim unused connections (e.g., HikariCP maxLifetime=30m, idleTimeout=10m).
4. Fix application code to use try-with-resources or context managers.
5. Add pool metrics (active/idle/waiting counts) to your monitoring.

L2 (7 questions)¶

1. What are common database isolation levels and their tradeoffs?

Show answer

Read Uncommitted: fastest, sees dirty reads. Read Committed: no dirty reads, but non-repeatable reads possible. Repeatable Read: consistent reads within transaction, but phantom rows possible. Serializable: full isolation, lowest concurrency. Most apps use Read Committed as default.

2. How do you identify missing indexes in a slow database?

Show answer

1. EXPLAIN/EXPLAIN ANALYZE on slow queries — look for sequential scans on large tables.
2. Check slow query log.
3. pg_stat_user_tables (PostgreSQL) shows seq_scan vs idx_scan ratios.
4. MySQL: performance_schema or SHOW STATUS LIKE 'Handler_read%'.

3. How do you resolve a database deadlock?

Show answer

The DB automatically kills one transaction (the victim). To prevent:
1. Access tables/rows in consistent order.
2. Keep transactions short.
3. Use lower isolation level if acceptable.
4. Add appropriate indexes to reduce lock scope.
5. Retry logic in the application for deadlock errors.

4. How do you safely run a schema migration on a production database?

Show answer

1. Test migration on a copy of production data first.
2. Use online DDL tools (pt-online-schema-change, gh-ost for MySQL) to avoid locking.
3. Run during low-traffic window.
4. Have a rollback plan.
5. Monitor replication lag and application errors during migration.
6. Take a backup before starting.

5. EXPLAIN ANALYZE shows your query uses an index scan but runtime has doubled over 3 weeks despite no data growth or query changes. What is the most likely cause in PostgreSQL?

Show answer

Table and index bloat from dead tuples. PostgreSQL's MVCC creates new tuple versions for every UPDATE and DELETE. Dead tuples (old versions) accumulate until VACUUM reclaims them. If autovacuum is falling behind or disabled, the table and its indexes grow with dead tuples. Index scans visit more pages because the index points to dead tuples interspersed with live data, and the table's physical size increases (more I/O per scan). Diagnosis: check pg_stat_user_tables for n_dead_tup and last_autovacuum. Fix: run VACUUM ANALYZE on the table. For severe bloat, pg_repack or VACUUM FULL (but VACUUM FULL locks the table). Prevention: tune autovacuum_vacuum_scale_factor and autovacuum_vacuum_cost_delay for tables with high update rates.

6. You need to add a NOT NULL constraint to an existing column that has NULL values. You cannot modify the data and you cannot take downtime. What is the safe migration path?

Show answer

Multi-step migration:
1. Add a DEFAULT value for the column (ALTER TABLE ... ALTER COLUMN ... SET DEFAULT 'value' — fast, metadata-only in PostgreSQL 11+).
2. Backfill NULLs in batches: UPDATE table SET col = default WHERE col IS NULL AND id BETWEEN X AND Y (batched to avoid long locks).
3. Add a CHECK constraint as NOT VALID first: ALTER TABLE ... ADD CONSTRAINT ... CHECK (col IS NOT NULL) NOT VALID — this is fast because it doesn't scan existing rows.
4. VALIDATE the constraint: ALTER TABLE ... VALIDATE CONSTRAINT ... — this scans the table but only holds a ShareUpdateExclusiveLock (reads and writes continue). If validation passes, you can then add the formal NOT NULL constraint. This approach avoids the full table lock that ALTER TABLE ... SET NOT NULL requires.

7. Your database supports point-in-time recovery. You take daily full backups at midnight. At 11am, a developer accidentally runs DELETE FROM orders WHERE 1=1. Your RPO is 1 hour. Can you recover to 10:55am?

Show answer

Yes, if WAL (Write-Ahead Log) archiving is enabled. PITR works by restoring the most recent full backup (midnight) and replaying WAL segments up to the target time (10:55am). Steps:
1. Restore the midnight base backup to a new instance.
2. Configure recovery_target_time = '2026-03-21 10:55:00' in the recovery configuration.
3. Start PostgreSQL — it replays WAL from midnight to 10:55am, stopping just before the accidental DELETE.
4. Verify the data, then promote the instance. Without WAL archiving, you can only recover to the last full backup (midnight) — 11 hours of data lost, violating the 1-hour RPO. WAL archiving is the critical component that enables PITR granularity down to individual transactions.