Skip to content

Databases

← Back to all decks

32 cards — 🟢 9 easy | 🟡 9 medium | 🔴 7 hard

🟢 Easy (9)

1. What is database sharding and when is it needed?

Show answer Sharding is a horizontal partitioning — splitting data across multiple database instances, each holding a subset of rows. It is needed when a single database server cannot handle the write throughput, storage volume, or query load.

Remember: sharding = horizontal partitioning across multiple database instances. Each shard holds a subset of data. Key challenge: choosing the shard key to avoid hotspots.

Gotcha: cross-shard JOINs and transactions become very expensive or impossible. Design your shard key so most queries hit a single shard.

Example: shard by user_id — all of user 42's data lives on one shard. Queries for user 42 hit one database, not all of them.

Remember: sharding = horizontal partitioning across multiple database instances. Each shard holds a subset of data. Key challenge: choosing the shard key to avoid hotspots.

2. What is a connection leak?

Show answer A connection leak is a situation where database connection isn't closed after being created and is no longer needed.

Debug clue: monitor connection count in pg_stat_activity (PostgreSQL) or SHOW PROCESSLIST (MySQL). A steadily growing number of idle connections is a leak.

Gotcha: connection leaks eventually exhaust max_connections, causing new connections to fail with 'too many connections' errors — often during peak traffic.

Example: in Python, always use context managers: with conn.cursor() as cur: ... or connection pool libraries that auto-reclaim connections.

Remember: choose your database based on access patterns, not brand loyalty. Relational for transactions, document for flexible schemas, graph for relationships, time-series for metrics.

Gotcha: 'NoSQL' does not mean 'no schema' — it means schema is enforced at the application level instead of the database level. This shifts responsibility to developers.

3. What is a connection pool?

Show answer Connection Pool is a cache of database connections and the reason it's used is to avoid an overhead of establishing a connection for every query done to a database.

Number anchor: a PostgreSQL connection costs ~10MB of RAM. Without pooling, 100 microservice replicas each opening 10 connections = 1000 connections = 10GB of PostgreSQL memory just for connections.

Example: PgBouncer (PostgreSQL) and ProxySQL (MySQL) are popular external connection poolers that multiplex thousands of app connections over a few dozen database connections.

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

4. What is a Data Warehouse?

Show answer A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of organisation's decision-making process

Remember: choose your database based on access patterns, not brand loyalty. Relational for transactions, document for flexible schemas, graph for relationships, time-series for metrics.

Gotcha: 'NoSQL' does not mean 'no schema' — it means schema is enforced at the application level instead of the database level. This shifts responsibility to developers.

5. What is a relational database?

Show answer * Data Storage: system to store data in tables
* SQL: programming language to manage relational databases
* Data Definition Language: a standard syntax to create, alter and delete tables

Name origin: 'Relational' comes from E.F. Codd's 1970 paper introducing the relational model, where data is organized into 'relations' (tables).

Example: popular relational databases — PostgreSQL (open source, feature-rich), MySQL/MariaDB (web-scale), SQLite (embedded), Oracle (enterprise), SQL Server (Microsoft).

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

6. What is an index in a database?

Show answer A database index is a data structure that improves the speed of operations in a table. Indexes can be created using one or more columns, providing the basis for both rapid random lookups and efficient ordering of access to records.

Remember: an index is a data structure (usually B-tree) that speeds up lookups at the cost of slower writes and extra storage. Like a book's index — you don't read every page to find a topic.

7. What type of databases are you familiar with?

Show answer Relational (SQL)
NoSQL
Time series

Remember: the major categories — Relational (PostgreSQL, MySQL), Document (MongoDB), Key-Value (Redis, DynamoDB), Column-family (Cassandra), Graph (Neo4j), Time-series (InfluxDB, TimescaleDB).

Interview tip: mention specific databases you have used and why — e.g., 'PostgreSQL for ACID transactions, Redis for caching, TimescaleDB for metrics.'

Remember: choose your database based on access patterns, not brand loyalty. Relational for transactions, document for flexible schemas, graph for relationships, time-series for metrics.

Gotcha: 'NoSQL' does not mean 'no schema' — it means schema is enforced at the application level instead of the database level. This shifts responsibility to developers.

8. What is DDL (Data Definition Language) in SQL?

Show answer [Wikipedia](https://en.wikipedia.org/wiki/Data_definition_language): "In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users."

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

9. When would you choose a NoSQL database over a relational database?

Show answer Choose NoSQL when:

* Schema is highly variable or evolving rapidly (document stores like MongoDB)
* You need horizontal write scaling (Cassandra, DynamoDB)
* Data is naturally key-value or graph-shaped (Redis, Neo4j)
* You need very low-latency lookups at massive scale
* Eventual consistency is acceptable

Stick with relational when:
* You need strong consistency and complex transactions (ACID)
* Your data has clear relationships and you need JOINs
* You need mature tooling for reporting and analytics

Remember: SQL = structured, ACID, joins, schema. NoSQL = flexible schema, horizontal scaling, eventual consistency. Use SQL for transactions, NoSQL for scale + flexibility.

🟡 Medium (9)

1. What is ORM? What benefits it provides in regards to relational databases usage?

Show answer [Wikipedia](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping): "is a programming technique for converting data between incompatible type systems using object-oriented programming languages"

In regards to the relational databases:

* Database as code
* Database abstraction
* Encapsulates SQL complexity
* Enables code review process
* Enables usage as a native OOP structure

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

2. What types of data tables have you used?

Show answer * Primary data table: main data you care about
* Details table: includes a foreign key and has one to many relationship
* Lookup values table: can be one table per lookup or a table containing all the lookups and has one to many relationship
* Multi reference table

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

3. Explain Normalization

Show answer Data that is used multiple times in a database should be stored once and referenced with a foreign key.
This has the clear benefit of ease of maintenance where you need to change a value only in a single place to change it everywhere.

Remember: normalization forms: 1NF (atomic values), 2NF (no partial dependencies), 3NF (no transitive dependencies). Mnemonic: 'The key, the whole key, and nothing but the key.'

4. Explain Primary Key and Foreign Key

Show answer Primary Key: each row in every table should a unique identifier that represents the row.
Foreign Key: a reference to another table's primary key. This allows you to join table together to retrieve all the information you need without duplicating data.

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

5. What is Time Series database?

Show answer A database designed specifically for time series based data. Time series data consists of data points indexed by time, such as metrics, logs, and sensor readings.

Key optimizations:

* **Time-based partitioning**: Data is automatically partitioned by time intervals, enabling efficient queries over specific time ranges
* **Compression**: Specialized compression algorithms that exploit the temporal nature of data (e.g., delta encoding, run-length encoding)

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

6. Explain database replication and when you would use it in production.

Show answer Database replication copies data from one database server (primary/master) to one or more replicas. Use cases:

* Read scaling: route read queries to replicas to reduce primary load
* High availability: promote a replica if the primary fails
* Disaster recovery: maintain replicas in another region/AZ
* Reporting: run heavy analytics queries against a replica

Replication modes:
* Synchronous: primary waits for replica acknowledgment (stronger consistency, higher latency)
* Asynchronous: primary does not wait (lower latency, risk of data loss on failover)
* Semi-synchronous: primary waits for at least one replica

7. What strategies exist for database backup and recovery?

Show answer Common strategies:

* Logical backups (pg_dump, mysqldump): SQL-level export, portable but slow for large DBs
* Physical backups (pg_basebackup, xtrabackup): file-level copy, faster restore
* WAL/binlog archiving: continuous archiving enables point-in-time recovery (PITR)
* Snapshots: storage-level snapshots (EBS, ZFS) for near-instant backup

Best practices:
* Test restores regularly
* Store backups off-site or in a different region
* Monitor backup job success/failure
* Document RTO and RPO targets

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

8. How should database schema migrations be handled in a DevOps workflow?

Show answer Best practices:

* Use a migration tool (Flyway, Alembic, Liquibase) to version schema changes
* Store migrations in version control alongside application code
* Make migrations idempotent and reversible when possible
* Run migrations as part of CI/CD pipeline, before application deployment
* Avoid destructive changes in production (prefer additive: add column, backfill, remove old)
* Test migrations against a copy of production data
* Use transactions for DDL where supported (PostgreSQL does, MySQL mostly does not)

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

9. What database metrics should you monitor in production?

Show answer Key metrics by category:

* Connections: active connections, pool utilization, connection wait time
* Performance: query latency (p50/p95/p99), queries per second, slow query count
* Resources: CPU, memory/buffer cache hit ratio, disk I/O (IOPS, latency)
* Replication: replica lag, replication slot status
* Storage: table size growth, WAL/binlog size, disk space remaining
* Locks: lock wait time, deadlock count, long-running transactions

Tools: pg_stat_statements, Performance Schema (MySQL), CloudWatch RDS, postgres_exporter

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

🔴 Hard (7)

1. You find out your database became a bottleneck and users experience issues accessing data. How can you deal with such situation?

Show answer Not much information provided as to why it became a bottleneck and what is current architecture, so one general approach could be
to reduce the load on your database by moving frequently-accessed data to in-memory structure.

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

2. What does it mean when a database is ACID compliant?

Show answer ACID ensures reliable database transactions:

- **Atomicity**: Transactions succeed or fail as a whole — no partial updates
- **Consistency**: Every transaction moves the DB from one valid state to another, enforced by constraints
- **Isolation**: Concurrent transactions don't see each other's intermediate states
- **Durability**: Committed data survives crashes (written to non-volatile storage)

SQL databases are ACID by nature. Some NoSQL databases (e.g., MongoDB 4.0+) support ACID transactions, but most NoSQL systems trade ACID guarantees for performance and scalability.

Remember: ACID = Atomicity (all or nothing), Consistency (valid state), Isolation (concurrent transactions don't interfere), Durability (committed data survives crashes). Mnemonic: 'All Changes Isolated Durably.'

3. Your database performs slowly than usual. More specifically, your queries are taking a lot of time. What would you do?

Show answer * Query for running queries and cancel the irrelevant queries
* Check for connection leaks (query for running connections and include their IP)
* Check for table locks and kill irrelevant locking sessions

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

4. What is the CAP theorem and how does it apply to database selection?

Show answer The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously:

* Consistency: every read returns the most recent write
* Availability: every request receives a response
* Partition tolerance: the system continues operating despite network partitions

In practice, network partitions are unavoidable, so the trade-off is between consistency and availability during a partition.
CP: etcd, ZooKeeper, HBase
AP: Cassandra, DynamoDB (eventual consistency)
CA: single-node PostgreSQL (no partition tolerance)

Remember: CAP = Consistency, Availability, Partition tolerance. Pick 2 of 3 during a network partition. CP = consistent but may reject requests. AP = available but may return stale data.

5. How do you diagnose and fix a slow query in PostgreSQL?

Show answer Diagnostic steps:

1. Enable slow query logging: set log_min_duration_statement (e.g., 500ms)
2. Use EXPLAIN ANALYZE to see the actual execution plan
3. Look for sequential scans on large tables, nested loops, inaccurate row estimates
4. Check pg_stat_statements for aggregate query performance

Common fixes:
* Add missing indexes (check pg_stat_user_tables for seq_scan counts)
* Rewrite queries to avoid unnecessary joins or subqueries
* Run ANALYZE to update planner statistics
* Tune work_mem, shared_buffers, effective_cache_size
* Consider partitioning for very large tables

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

6. Describe a high-availability PostgreSQL setup and its failure modes.

Show answer Typical HA setup:

* Primary + synchronous standby (or Patroni-managed cluster)
* Connection pooler (PgBouncer) in front
* Automated failover via Patroni, repmgr, or cloud-managed (RDS Multi-AZ)

Failure modes:
* Primary crash: standby promoted, clients reconnect (brief downtime)
* Network partition: split-brain risk if fencing misconfigured, use quorum/witness
* Replica lag: synchronous replication prevents data loss but adds latency
* Connection pooler failure: run multiple instances behind a VIP or DNS
* Storage failure: RAID or cloud volume replication protects against single-disk failure

Remember: database fundamentals (ACID, indexing, normalization, query optimization) transfer across all database systems. Master the concepts, then learn vendor-specific syntax.

7. Explain transaction isolation levels and when each is appropriate.

Show answer SQL standard defines four levels (weakest to strongest):

* Read Uncommitted: can see uncommitted changes (dirty reads). Rarely used.
* Read Committed: only sees committed data. Default in PostgreSQL. Good for most OLTP.
* Repeatable Read: snapshot at transaction start. Default in MySQL InnoDB. Good for reporting.
* Serializable: transactions behave as if sequential. Highest overhead. Use for financial calculations or inventory where correctness is critical.

Higher isolation = more locking/overhead = lower throughput.

Remember: transaction isolation levels (low to high): Read Uncommitted, Read Committed, Repeatable Read, Serializable. Higher isolation = fewer anomalies but more locking/overhead.