Skip to content

Runbook: PostgreSQL Replication Lag

Field Value
Domain Databases
Alert pg_replication_lag_seconds > 30 or replica behind primary by >100 MB
Severity P2
Est. Resolution Time 20-45 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites psql access to primary and replica, read access to pg_stat_replication on primary

Quick Assessment (30 seconds)

# Run this first on the PRIMARY — it tells you the scope of the problem
psql -h <PRIMARY_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_replication;"
If output shows: rows with large sent_lsn - replay_lsn values or state not streaming → Continue to Step 1 If output shows: empty result → Replica may have disconnected entirely; this is more severe — skip to Step 4 and check replica connectivity

Step 1: Measure Replication Lag on the Primary

Why: The primary tracks exactly how far behind each replica is. This number tells you whether lag is growing (active problem) or stable (backlog draining normally).

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT client_addr,
       application_name,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
       write_lag,
       flush_lag,
       replay_lag
FROM pg_stat_replication;
"
Expected output (healthy):
  client_addr  | application_name | state     | lag_bytes | replay_lag
---------------+------------------+-----------+-----------+------------
 10.0.1.100    | replica1         | streaming |      8192 | 00:00:00.5
If this fails: You do not have pg_monitor privileges. Use superuser credentials or ask the DBA team to run this query.

Step 2: Check Lag Directly on the Replica

Why: The replica's own view of lag confirms the primary's data and reveals whether the replica is applying WAL or just receiving it.

# Run this on the REPLICA
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag,
       pg_is_in_recovery() AS is_replica,
       pg_last_wal_receive_lsn() AS last_received,
       pg_last_wal_replay_lsn() AS last_replayed;
"
Expected output:
 replication_lag | is_replica | last_received | last_replayed
-----------------+------------+---------------+---------------
 00:00:00.8      | t          | 0/15A00000    | 0/15A00000
If last_received is far ahead of last_replayed, the replica is receiving WAL faster than it can apply it — the bottleneck is replica disk I/O or CPU. If this fails: You may be connected to the wrong host. Confirm with SELECT pg_is_in_recovery(); — it must return t to be a replica.

Step 3: Check Replication Slot Status for WAL Bloat

Why: Unused or inactive replication slots force the primary to retain WAL indefinitely, which can fill the disk and also indicate that a replica has silently stopped.

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT slot_name,
       slot_type,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
ORDER BY retained_bytes DESC;
"
Expected output:
  slot_name  | slot_type | active | retained_wal
-------------+-----------+--------+--------------
 replica1    | physical  | t      | 16 MB
If active = f and retained_wal is large (gigabytes), the slot is holding WAL hostage. Do not drop the slot without confirming it is truly unused — dropping an active slot will break the replica. If this fails: You need superuser access to view all replication slot details.

Step 4: Check Network Connectivity Between Primary and Replica

Why: A degraded network link (packet loss, bandwidth saturation) causes WAL to back up even if both hosts are healthy.

# Run from the replica host
ping -c 20 <PRIMARY_HOST>

# Check TCP throughput (requires iperf3 installed on both hosts)
iperf3 -c <PRIMARY_HOST> -t 10

# Check if WAL receiver is connected
psql -h <REPLICA_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_wal_receiver;"
Expected output:
 status | receive_start_lsn | last_msg_send_time | last_msg_receipt_time
--------+-------------------+--------------------+-----------------------
 streaming | 0/1500000      | 2026-03-19 ...    | 2026-03-19 ...
If this fails: If pg_stat_wal_receiver is empty, the WAL receiver process has stopped. Check PostgreSQL logs on the replica: sudo tail -100 /var/log/postgresql/postgresql-*.log

Step 5: Check Replica System Load and Disk I/O

Why: The replica may be receiving WAL fast enough but falling behind applying it due to high CPU or slow disk. This is the most common cause of sustained lag on a healthy network.

# Run on the REPLICA host
# CPU and load
uptime && top -bn1 | head -15

# Disk I/O — are writes saturating the disk?
iostat -x 1 5

# Check if PostgreSQL walreceiver/walwriter processes are running
ps aux | grep -E 'wal(receiver|writer|startup)'
Expected output (healthy):
# uptime: load average below number of CPU cores
# iostat: %util below 80% on the data volume
# ps: walreceiver process running
If this fails: If load is extremely high, the replica may need vertical scaling or the WAL apply rate needs tuning. Escalate to the DBA team with this data.

Step 6: Check for Long-Running Transactions on the Primary Blocking Cleanup

Why: Long transactions on the primary prevent VACUUM from running, which causes table bloat and can indirectly slow replication by generating excess WAL for cleanup operations.

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT pid,
       usename,
       now() - xact_start AS txn_duration,
       left(query, 100) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND now() - xact_start > INTERVAL '5 minutes'
ORDER BY txn_duration DESC;
"
Expected output (healthy): No rows, or rows with short durations. If this fails: See long-running-query.md for how to identify and resolve blocking transactions.

Verification

# Confirm lag is decreasing — run twice 60 seconds apart and compare
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
"
Success looks like: Lag value is decreasing between the two checks, and eventually reaches under 5 seconds. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min DBA on-call "Replication lag on is seconds and growing; primary: "
Data loss suspected DBA / Data Lead "Possible data loss: replica disconnected; unsure if failover is safe"
Scope expanding Platform team "All replicas lagging simultaneously — possible primary disk or network failure"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Verify wal_keep_size is set appropriately to buffer lag spikes without filling disk
  • Check whether max_wal_size is tuned for write workload
  • Confirm all replication slots are actively used — remove unused ones

Common Mistakes

  1. Dropping an unused-looking replication slot without confirming it is truly unused: A slot with active = f may belong to a replica that is temporarily disconnected. Dropping it will prevent that replica from catching up and may force a full re-sync.
  2. Not checking if the replica is paused: A replica can be manually paused with SELECT pg_wal_replay_pause();. Check with SELECT pg_is_wal_replay_paused(); — if t, run SELECT pg_wal_replay_resume();.
  3. Confusing streaming replication lag with logical replication lag: These are different systems. pg_stat_replication covers streaming (physical) replicas. Logical replication lag is visible in pg_stat_subscription. The fixes are different.
  4. Promoting a lagging replica without verifying it has caught up: If you promote a replica that is still 10 minutes behind, you will lose all writes from that window. Always verify lag is near zero before any planned failover.

Cross-References

  • Topic Pack: PostgreSQL Replication (deep background on streaming vs logical replication, WAL, and failover)
  • Related Runbook: postgres-conn-exhaustion.md — high connection counts on primary can slow WAL generation
  • Related Runbook: postgres-disk-space.md — WAL accumulation from inactive slots fills disk

Wiki Navigation