- devops
- l2
- runbook
- postgresql
- replication --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Database Replication | Domain: DevOps & Tooling
Runbook: PostgreSQL Replication Lag¶
| Field | Value |
|---|---|
| Domain | Databases |
| Alert | pg_replication_lag_seconds > 30 or replica behind primary by >100 MB |
| Severity | P2 |
| Est. Resolution Time | 20-45 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | psql access to primary and replica, read access to pg_stat_replication on primary |
Quick Assessment (30 seconds)¶
# Run this first on the PRIMARY — it tells you the scope of the problem
psql -h <PRIMARY_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_replication;"
sent_lsn - replay_lsn values or state not streaming → Continue to Step 1
If output shows: empty result → Replica may have disconnected entirely; this is more severe — skip to Step 4 and check replica connectivity
Step 1: Measure Replication Lag on the Primary¶
Why: The primary tracks exactly how far behind each replica is. This number tells you whether lag is growing (active problem) or stable (backlog draining normally).
psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT client_addr,
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
"
client_addr | application_name | state | lag_bytes | replay_lag
---------------+------------------+-----------+-----------+------------
10.0.1.100 | replica1 | streaming | 8192 | 00:00:00.5
pg_monitor privileges. Use superuser credentials or ask the DBA team to run this query.
Step 2: Check Lag Directly on the Replica¶
Why: The replica's own view of lag confirms the primary's data and reveals whether the replica is applying WAL or just receiving it.
# Run this on the REPLICA
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_is_in_recovery() AS is_replica,
pg_last_wal_receive_lsn() AS last_received,
pg_last_wal_replay_lsn() AS last_replayed;
"
replication_lag | is_replica | last_received | last_replayed
-----------------+------------+---------------+---------------
00:00:00.8 | t | 0/15A00000 | 0/15A00000
last_received is far ahead of last_replayed, the replica is receiving WAL faster than it can apply it — the bottleneck is replica disk I/O or CPU.
If this fails: You may be connected to the wrong host. Confirm with SELECT pg_is_in_recovery(); — it must return t to be a replica.
Step 3: Check Replication Slot Status for WAL Bloat¶
Why: Unused or inactive replication slots force the primary to retain WAL indefinitely, which can fill the disk and also indicate that a replica has silently stopped.
psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
ORDER BY retained_bytes DESC;
"
slot_name | slot_type | active | retained_wal
-------------+-----------+--------+--------------
replica1 | physical | t | 16 MB
active = f and retained_wal is large (gigabytes), the slot is holding WAL hostage. Do not drop the slot without confirming it is truly unused — dropping an active slot will break the replica.
If this fails: You need superuser access to view all replication slot details.
Step 4: Check Network Connectivity Between Primary and Replica¶
Why: A degraded network link (packet loss, bandwidth saturation) causes WAL to back up even if both hosts are healthy.
# Run from the replica host
ping -c 20 <PRIMARY_HOST>
# Check TCP throughput (requires iperf3 installed on both hosts)
iperf3 -c <PRIMARY_HOST> -t 10
# Check if WAL receiver is connected
psql -h <REPLICA_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_wal_receiver;"
status | receive_start_lsn | last_msg_send_time | last_msg_receipt_time
--------+-------------------+--------------------+-----------------------
streaming | 0/1500000 | 2026-03-19 ... | 2026-03-19 ...
pg_stat_wal_receiver is empty, the WAL receiver process has stopped. Check PostgreSQL logs on the replica: sudo tail -100 /var/log/postgresql/postgresql-*.log
Step 5: Check Replica System Load and Disk I/O¶
Why: The replica may be receiving WAL fast enough but falling behind applying it due to high CPU or slow disk. This is the most common cause of sustained lag on a healthy network.
# Run on the REPLICA host
# CPU and load
uptime && top -bn1 | head -15
# Disk I/O — are writes saturating the disk?
iostat -x 1 5
# Check if PostgreSQL walreceiver/walwriter processes are running
ps aux | grep -E 'wal(receiver|writer|startup)'
# uptime: load average below number of CPU cores
# iostat: %util below 80% on the data volume
# ps: walreceiver process running
Step 6: Check for Long-Running Transactions on the Primary Blocking Cleanup¶
Why: Long transactions on the primary prevent VACUUM from running, which causes table bloat and can indirectly slow replication by generating excess WAL for cleanup operations.
psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT pid,
usename,
now() - xact_start AS txn_duration,
left(query, 100) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
AND now() - xact_start > INTERVAL '5 minutes'
ORDER BY txn_duration DESC;
"
Verification¶
# Confirm lag is decreasing — run twice 60 seconds apart and compare
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
"
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | DBA on-call | "Replication lag on |
| Data loss suspected | DBA / Data Lead | "Possible data loss: replica |
| Scope expanding | Platform team | "All replicas lagging simultaneously — possible primary disk or network failure" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Verify
wal_keep_sizeis set appropriately to buffer lag spikes without filling disk - Check whether
max_wal_sizeis tuned for write workload - Confirm all replication slots are actively used — remove unused ones
Common Mistakes¶
- Dropping an unused-looking replication slot without confirming it is truly unused: A slot with
active = fmay belong to a replica that is temporarily disconnected. Dropping it will prevent that replica from catching up and may force a full re-sync. - Not checking if the replica is paused: A replica can be manually paused with
SELECT pg_wal_replay_pause();. Check withSELECT pg_is_wal_replay_paused();— ift, runSELECT pg_wal_replay_resume();. - Confusing streaming replication lag with logical replication lag: These are different systems.
pg_stat_replicationcovers streaming (physical) replicas. Logical replication lag is visible inpg_stat_subscription. The fixes are different. - Promoting a lagging replica without verifying it has caught up: If you promote a replica that is still 10 minutes behind, you will lose all writes from that window. Always verify lag is near zero before any planned failover.
Cross-References¶
- Topic Pack: PostgreSQL Replication (deep background on streaming vs logical replication, WAL, and failover)
- Related Runbook: postgres-conn-exhaustion.md — high connection counts on primary can slow WAL generation
- Related Runbook: postgres-disk-space.md — WAL accumulation from inactive slots fills disk
Wiki Navigation¶
Related Content¶
- PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
- Runbook: Long-Running Query / Lock Contention (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Connection Exhaustion (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations