Db Read Replicas¶

18 cards — 🟢 4 easy | 🟡 8 medium | 🔴 6 hard

🟢 Easy (4)¶

1. What is the primary purpose of read replicas in a database architecture?

Show answer

To scale read traffic by offloading read queries from the primary to one or more replica nodes, reducing load on the primary and improving overall throughput for read-heavy workloads.

2. How do you check if a PostgreSQL instance is currently running as a replica?

Show answer

SELECT pg_is_in_recovery(); — returns true if the instance is in recovery mode (i.e., running as a replica), false if it is the primary.

3. How do you measure replication delay in seconds on a PostgreSQL replica?

Show answer

SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay; This shows the time since the last replayed transaction from the primary.

4. What are recovery conflicts on a PostgreSQL replica and how do you handle them?

Show answer

Recovery conflicts occur when a query running on a replica blocks WAL replay (e.g., the primary vacuums a row the replica query is reading). PostgreSQL cancels the query after max_standby_streaming_delay (default 30s). Handle by increasing the delay, using hot_standby_feedback = on (tells primary not to vacuum rows replicas need), or accepting occasional query cancellations.

🟡 Medium (8)¶

1. Name three tools used for connection routing between a PostgreSQL primary and read replicas.

Show answer

PgBouncer (connection pooler, does not route by query type), Pgpool-II (query-aware load balancer that routes SELECTs to replicas), and HAProxy (TCP-level load balancing with health checks).

2. What is the read-after-write consistency problem with read replicas?

Show answer

When a user writes data to the primary and then immediately reads from a replica, the replica may not have replayed the write yet due to replication lag. The user sees stale data or gets a 404 for a record they just created. Fix by routing read-your-own-writes to the primary or a synchronous replica.

3. When should reads go to the primary vs an async replica?

Show answer

Reads requiring strong consistency (authentication, payments, inventory checks) should go to the primary or a synchronous replica. Reads tolerating eventual consistency (dashboards, search, analytics, reporting) can safely go to async replicas.

4. What are typical alerting thresholds for replication lag on read replicas?

Show answer

Warning at replication lag > 5 seconds (sustained for 2 minutes), Critical at lag > 30 seconds or replica disconnected (for 1 minute), Page when replica count drops below the minimum healthy count (e.g., fewer than 2 of 3 replicas connected).

5. What are two ways to measure replication lag and what are their limitations?

Show answer

Time-based: SELECT now() - pg_last_xact_replay_timestamp(). Limitation: shows false lag when the primary is idle (no new transactions to replay).
Byte-based: compare pg_current_wal_lsn() on primary with pg_last_wal_replay_lsn() on replica. Limitation: bytes don't directly translate to time — a large transaction may inflate byte lag without meaning the replica is far behind.

6. How does Pgpool-II route queries differently from PgBouncer?

Show answer

Pgpool-II parses SQL and routes SELECT statements to replicas and writes to the primary automatically. PgBouncer is a connection pooler only — it does not inspect queries or route by type. For query-aware routing with PgBouncer, the application must use separate connection strings for read and write pools.

7. What is cascading replication and when is it useful?

Show answer

Cascading replication allows a replica to stream WAL to other replicas instead of all replicas connecting directly to the primary. Configure with primary_conninfo pointing to an upstream replica. Useful when you have many replicas — it reduces network and CPU load on the primary by fanning out through intermediate replicas.

8. What metrics should you monitor on read replicas to detect problems early?

Show answer

Replication lag (seconds and bytes), replay rate (WAL replayed per second), connection count vs max_connections, CPU and I/O utilization, streaming state in pg_stat_replication (should be 'streaming'), and query cancellation rate on the replica due to recovery conflicts (pg_stat_database_conflicts).

🔴 Hard (6)¶

1. How does PostgreSQL's synchronous_commit setting enable per-transaction consistency control with replicas?

Show answer

synchronous_commit can be set per transaction: 'remote_apply' ensures the write is replayed on the sync replica before commit returns (safe for immediate reads from that replica). 'local' only waits for local WAL flush (faster but replica may lag). This lets you choose strong vs eventual consistency on a per-query basis.

2. In a read replica architecture, what happens if all replicas become unreachable?

Show answer

All read traffic either fails (if the application requires replicas) or falls back to the primary (if routing allows it), potentially overloading the primary. A well-designed system has circuit breakers to detect failed replicas, routes reads to the primary as a fallback with rate limiting, and alerts immediately so ops can restore replicas or add capacity.

3. How do you verify from the primary that all expected replicas are connected and healthy?

Show answer

SELECT application_name, client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes, replay_lag FROM pg_stat_replication; Check that the count matches expected replicas, all show state = 'streaming', and replay_lag_bytes is within acceptable thresholds.

4. Name three strategies for achieving read-after-write consistency with async replicas.

Show answer

1. Route read-your-own-writes to the primary for a short window after each write (e.g., 5 seconds).
2. Include the primary's LSN in the write response and only read from a replica once its replay LSN has caught up.
3. Use synchronous_commit = 'remote_apply' for critical writes so the replica is guaranteed to have replayed them before the commit returns.

5. What steps are involved in promoting a replica to primary during failover?

Show answer

1. Verify the replica is the most up-to-date (least lag).
2. Fence or stop the old primary to prevent split-brain.
3. Promote the replica: SELECT pg_promote() or pg_ctl promote.
4. Update DNS or connection routing to point at the new primary.
5. Reconfigure remaining replicas to follow the new primary.
6. Verify replication resumes and applications reconnect.

6. When can adding read replicas make performance worse instead of better?

Show answer

When the workload is write-heavy (replicas must replay all writes but serve few reads), when replication lag causes frequent cache invalidation or application retries, when queries on replicas conflict with WAL replay causing cancellations (max_standby_streaming_delay), or when the connection routing layer adds more latency than the replica offloads.