devops
l2
runbook
postgresql
replication --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Database Replication | Domain: DevOps & Tooling

Runbook: PostgreSQL Replication Lag¶

Field	Value
Domain	Databases
Alert	`pg_replication_lag_seconds > 30` or replica behind primary by >100 MB
Severity	P2
Est. Resolution Time	20-45 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	psql access to primary and replica, read access to `pg_stat_replication` on primary

Quick Assessment (30 seconds)¶

# Run this first on the PRIMARY — it tells you the scope of the problem
psql -h <PRIMARY_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_replication;"

If output shows: rows with large sent_lsn - replay_lsn values or state not streaming → Continue to Step 1 If output shows: empty result → Replica may have disconnected entirely; this is more severe — skip to Step 4 and check replica connectivity

Step 1: Measure Replication Lag on the Primary¶

Why: The primary tracks exactly how far behind each replica is. This number tells you whether lag is growing (active problem) or stable (backlog draining normally).

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT client_addr,
       application_name,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
       write_lag,
       flush_lag,
       replay_lag
FROM pg_stat_replication;
"

Expected output (healthy):

  client_addr  | application_name | state     | lag_bytes | replay_lag
---------------+------------------+-----------+-----------+------------
 10.0.1.100    | replica1         | streaming |      8192 | 00:00:00.5

If this fails: You do not have pg_monitor privileges. Use superuser credentials or ask the DBA team to run this query.

Step 2: Check Lag Directly on the Replica¶

Why: The replica's own view of lag confirms the primary's data and reveals whether the replica is applying WAL or just receiving it.

# Run this on the REPLICA
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag,
       pg_is_in_recovery() AS is_replica,
       pg_last_wal_receive_lsn() AS last_received,
       pg_last_wal_replay_lsn() AS last_replayed;
"

Expected output:

 replication_lag | is_replica | last_received | last_replayed
-----------------+------------+---------------+---------------
 00:00:00.8      | t          | 0/15A00000    | 0/15A00000

If last_received is far ahead of last_replayed, the replica is receiving WAL faster than it can apply it — the bottleneck is replica disk I/O or CPU. If this fails: You may be connected to the wrong host. Confirm with SELECT pg_is_in_recovery(); — it must return t to be a replica.

Step 3: Check Replication Slot Status for WAL Bloat¶

Why: Unused or inactive replication slots force the primary to retain WAL indefinitely, which can fill the disk and also indicate that a replica has silently stopped.

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT slot_name,
       slot_type,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
ORDER BY retained_bytes DESC;
"

Expected output:

  slot_name  | slot_type | active | retained_wal
-------------+-----------+--------+--------------
 replica1    | physical  | t      | 16 MB

If active = f and retained_wal is large (gigabytes), the slot is holding WAL hostage. Do not drop the slot without confirming it is truly unused — dropping an active slot will break the replica. If this fails: You need superuser access to view all replication slot details.

Step 4: Check Network Connectivity Between Primary and Replica¶

Why: A degraded network link (packet loss, bandwidth saturation) causes WAL to back up even if both hosts are healthy.

# Run from the replica host
ping -c 20 <PRIMARY_HOST>

# Check TCP throughput (requires iperf3 installed on both hosts)
iperf3 -c <PRIMARY_HOST> -t 10

# Check if WAL receiver is connected
psql -h <REPLICA_HOST> -U <DB_USER> -c "SELECT * FROM pg_stat_wal_receiver;"

Expected output:

 status | receive_start_lsn | last_msg_send_time | last_msg_receipt_time
--------+-------------------+--------------------+-----------------------
 streaming | 0/1500000      | 2026-03-19 ...    | 2026-03-19 ...

If this fails: If pg_stat_wal_receiver is empty, the WAL receiver process has stopped. Check PostgreSQL logs on the replica: sudo tail -100 /var/log/postgresql/postgresql-*.log

Step 5: Check Replica System Load and Disk I/O¶

Why: The replica may be receiving WAL fast enough but falling behind applying it due to high CPU or slow disk. This is the most common cause of sustained lag on a healthy network.

# Run on the REPLICA host
# CPU and load
uptime && top -bn1 | head -15

# Disk I/O — are writes saturating the disk?
iostat -x 1 5

# Check if PostgreSQL walreceiver/walwriter processes are running
ps aux | grep -E 'wal(receiver|writer|startup)'

Expected output (healthy):

# uptime: load average below number of CPU cores
# iostat: %util below 80% on the data volume
# ps: walreceiver process running

If this fails: If load is extremely high, the replica may need vertical scaling or the WAL apply rate needs tuning. Escalate to the DBA team with this data.

Step 6: Check for Long-Running Transactions on the Primary Blocking Cleanup¶

Why: Long transactions on the primary prevent VACUUM from running, which causes table bloat and can indirectly slow replication by generating excess WAL for cleanup operations.

psql -h <PRIMARY_HOST> -U <DB_USER> -c "
SELECT pid,
       usename,
       now() - xact_start AS txn_duration,
       left(query, 100) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND now() - xact_start > INTERVAL '5 minutes'
ORDER BY txn_duration DESC;
"

Expected output (healthy): No rows, or rows with short durations. If this fails: See long-running-query.md for how to identify and resolve blocking transactions.

Verification¶

# Confirm lag is decreasing — run twice 60 seconds apart and compare
psql -h <REPLICA_HOST> -U <DB_USER> -c "
SELECT now() - pg_last_xact_replay_timestamp() AS lag;
"

Success looks like: Lag value is decreasing between the two checks, and eventually reaches under 5 seconds. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	DBA on-call	"Replication lag on is seconds and growing; primary: "
Data loss suspected	DBA / Data Lead	"Possible data loss: replica disconnected; unsure if failover is safe"
Scope expanding	Platform team	"All replicas lagging simultaneously — possible primary disk or network failure"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Verify wal_keep_size is set appropriately to buffer lag spikes without filling disk
Check whether max_wal_size is tuned for write workload
Confirm all replication slots are actively used — remove unused ones

Common Mistakes¶

Dropping an unused-looking replication slot without confirming it is truly unused: A slot with active = f may belong to a replica that is temporarily disconnected. Dropping it will prevent that replica from catching up and may force a full re-sync.
Not checking if the replica is paused: A replica can be manually paused with SELECT pg_wal_replay_pause();. Check with SELECT pg_is_wal_replay_paused(); — if t, run SELECT pg_wal_replay_resume();.
Confusing streaming replication lag with logical replication lag: These are different systems. pg_stat_replication covers streaming (physical) replicas. Logical replication lag is visible in pg_stat_subscription. The fixes are different.
Promoting a lagging replica without verifying it has caught up: If you promote a replica that is still 10 minutes behind, you will lose all writes from that window. Always verify lag is near zero before any planned failover.

Cross-References¶

Topic Pack: PostgreSQL Replication (deep background on streaming vs logical replication, WAL, and failover)
Related Runbook: postgres-conn-exhaustion.md — high connection counts on primary can slow WAL generation
Related Runbook: postgres-disk-space.md — WAL accumulation from inactive slots fills disk

PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
Runbook: Long-Running Query / Lock Contention (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Connection Exhaustion (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations

Runbook: PostgreSQL Replication Lag¶

Quick Assessment (30 seconds)¶

Step 1: Measure Replication Lag on the Primary¶

Step 2: Check Lag Directly on the Replica¶

Step 3: Check Replication Slot Status for WAL Bloat¶

Step 4: Check Network Connectivity Between Primary and Replica¶

Step 5: Check Replica System Load and Disk I/O¶

Step 6: Check for Long-Running Transactions on the Primary Blocking Cleanup¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: PostgreSQL Replication Lag¶

Quick Assessment (30 seconds)¶

Step 1: Measure Replication Lag on the Primary¶

Step 2: Check Lag Directly on the Replica¶

Step 3: Check Replication Slot Status for WAL Bloat¶

Step 4: Check Network Connectivity Between Primary and Replica¶

Step 5: Check Replica System Load and Disk I/O¶

Step 6: Check for Long-Running Transactions on the Primary Blocking Cleanup¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶