devops
l2
runbook
postgresql
locking --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Database Locking | Domain: DevOps & Tooling

Runbook: Long-Running Query / Lock Contention¶

Field	Value
Domain	Databases
Alert	`pg_stat_activity_max_tx_duration > 300` (5 minutes) or deadlock errors in application logs
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	30 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	psql access, `pg_stat_activity` visibility, `pg_cancel_backend` or `pg_terminate_backend` permission

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
       now() - query_start AS duration,
       left(query, 80) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
"

If output shows: One or more queries running for minutes → Continue to Step 1 If output shows: All durations are short (under 30 seconds) → The query finished; the alert may be stale or the problem is intermittent — check application logs for deadlock messages and see postgres-conn-exhaustion.md if connections are still high

Step 1: Find All Long-Running Queries and Transactions¶

Why: You need to distinguish between a long-running query (bad plan, missing index) and a long-running idle transaction (application bug, connection leak) — the fix is different for each.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
       usename,
       application_name,
       now() - xact_start AS txn_duration,
       now() - query_start AS query_duration,
       state,
       wait_event_type,
       wait_event,
       left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY txn_duration DESC
LIMIT 10;
"

Expected output:

  pid  | usename  | txn_duration | query_duration | state  | wait_event_type | query
-------+----------+--------------+----------------+--------+-----------------+-------
 23456 | app_user | 00:12:34     | 00:12:34       | active | Lock            | UPDATE orders SET ...
 23457 | app_user | 00:01:23     | 00:00:00.1     | idle   |                 | COMMIT

wait_event_type = Lock means this query is waiting on another lock — it is blocked, not just slow. If this fails: Check that your user has pg_monitor role or is a superuser. Without it, you can only see your own queries.

Step 2: Find Lock Waiters and Their Blockers¶

Why: In a lock contention scenario, there is always a chain: one session holds a lock, others pile up waiting. You must kill the root blocker — killing a waiter does nothing, another waiter will immediately take its place.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT blocked_locks.pid          AS blocked_pid,
       blocked_activity.usename   AS blocked_user,
       blocking_locks.pid         AS blocking_pid,
       blocking_activity.usename  AS blocking_user,
       now() - blocked_activity.xact_start AS blocked_duration,
       left(blocked_activity.query, 80)    AS blocked_query,
       left(blocking_activity.query, 80)   AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks
     ON blocking_locks.locktype    = blocked_locks.locktype
    AND blocking_locks.relation    = blocked_locks.relation
    AND blocking_locks.granted
    AND NOT blocked_locks.granted
JOIN pg_catalog.pg_stat_activity blocked_activity
     ON blocked_activity.pid  = blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
     ON blocking_activity.pid = blocking_locks.pid;
"

Expected output:

 blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_duration | blocked_query
-------------+--------------+--------------+---------------+------------------+---------------
 23456       | app_user     | 12345        | app_user      | 00:05:12         | UPDATE orders...

The blocking_pid is your target. Note it down. If this fails: If no rows are returned but queries are still slow, the bottleneck may be I/O or CPU, not lock contention. Check with wait_event_type values: IO means disk bound, Client means waiting for the application.

Step 3: Attempt a Graceful Cancel First¶

Why: pg_cancel_backend sends SIGINT, which lets the query clean up and roll back the transaction cleanly. This is always preferable to a hard terminate because it avoids leaving the session in an inconsistent state.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_cancel_backend(<BLOCKING_PID>);
"

Expected output:

 pg_cancel_backend
-------------------
 t

Wait 10-15 seconds, then re-run the query from Step 1 to confirm the blocking query is gone. If this fails: The query may be in a non-cancellable state (e.g., waiting on a lock or executing a built-in function). Proceed to Step 4.

Step 4: Terminate the Blocking Query If Cancel Fails¶

Why: pg_terminate_backend sends SIGTERM, which forces the backend process to exit immediately. The transaction is rolled back, but the mechanism is less clean than a cancel.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_terminate_backend(<BLOCKING_PID>);
"

Expected output:

 pg_terminate_backend
----------------------
 t

Confirm the blocking session has ended:

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid FROM pg_stat_activity WHERE pid = <BLOCKING_PID>;
"

Expected: zero rows returned. If this fails: You may need superuser privileges. If you do not have them, escalate to the DBA on-call immediately.

Step 5: Analyze the Problematic Query to Understand Why It Was Slow¶

Why: Terminating the query is triage, not a fix. If the same query runs again, it will block again. You must understand why it was slow to prevent recurrence.

# Get the full query text from slow query log or pg_stat_statements
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT query, calls, mean_exec_time, max_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"

# Then EXPLAIN the specific query (replace with the actual query)
psql -h <DB_HOST> -U <DB_USER> -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<SLOW_QUERY_TEXT>;
"

Expected output: The EXPLAIN ANALYZE output shows the query plan. Look for Seq Scan on large tables (suggests a missing index) or Nested Loop on large result sets (suggests a join strategy issue). If this fails: pg_stat_statements may not be enabled. Check SHOW shared_preload_libraries; — it must include pg_stat_statements.

Step 6: Add an Index or Rewrite the Query to Prevent Recurrence¶

Why: The permanent fix is structural. Either the query needs a better execution plan (via an index) or the application code needs to be changed to avoid the long-running transaction.

# Create an index concurrently to avoid locking the table during creation
psql -h <DB_HOST> -U <DB_USER> -c "
CREATE INDEX CONCURRENTLY idx_<TABLE>_<COLUMN>
ON <SCHEMA>.<TABLE> (<COLUMN>);
"

# Set a statement timeout to auto-kill queries that exceed a threshold (add to postgresql.conf or per-role)
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
ALTER ROLE <APP_USER> SET statement_timeout = '5min';
"

Expected output:

CREATE INDEX

If this fails: CREATE INDEX CONCURRENTLY cannot run inside a transaction. If you see that error, you are inside a BEGIN block — run ROLLBACK first, then retry.

Verification¶

# Confirm the issue is resolved
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS blocked_queries
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
"

Success looks like: blocked_queries = 0, and the query that was slow now completes in under a second. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 30 min	DBA on-call	"Lock contention on : pid blocking queries; unable to terminate with available permissions"
Data loss suspected	DBA / Data Lead	"Possible data loss: forced termination of a write transaction on , transaction may not have committed"
Scope expanding	Platform team	"Lock chain growing — multiple tables affected on ; application writes are stalled"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Set lock_timeout and statement_timeout for application roles to prevent future runaway transactions
Review application connection pool settings — idle-in-transaction sessions often come from pool misconfiguration
Enable log_lock_waits = on and deadlock_timeout = 1s for better observability of future incidents

Common Mistakes¶

Terminating the blocked query instead of the blocking one: This does nothing to resolve the lock contention — the blocking session still holds the lock. Always identify and kill the blocking_pid, not the blocked_pid.
Not attempting pg_cancel_backend before pg_terminate_backend: Cancel gives the query a chance to roll back cleanly. Terminate is forceful and leaves less room for the application to handle the interruption gracefully.
Confusing wait_event_type values: Lock means waiting for a database lock (another session). IO means waiting for disk reads/writes. Client means waiting for the application to send data. Each requires a different response.
Running EXPLAIN ANALYZE on the live production table for a destructive query: EXPLAIN ANALYZE actually executes the query. If the query is a DELETE or UPDATE, wrapping it in a transaction you can roll back is essential: BEGIN; EXPLAIN ANALYZE <QUERY>; ROLLBACK;

Cross-References¶

Topic Pack: PostgreSQL Query Performance (deep background on EXPLAIN output, index types, and query planning)
Related Runbook: postgres-conn-exhaustion.md — long transactions hold connections open, causing exhaustion
Related Runbook: postgres-disk-space.md — bloated tables from missed VACUUM are often behind slow sequential scans

PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
Runbook: PostgreSQL Connection Exhaustion (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations
Runbook: PostgreSQL Replication Lag (Runbook, L2) — PostgreSQL Operations

Runbook: Long-Running Query / Lock Contention¶

Quick Assessment (30 seconds)¶

Step 1: Find All Long-Running Queries and Transactions¶

Step 2: Find Lock Waiters and Their Blockers¶

Step 3: Attempt a Graceful Cancel First¶

Step 4: Terminate the Blocking Query If Cancel Fails¶

Step 5: Analyze the Problematic Query to Understand Why It Was Slow¶

Step 6: Add an Index or Rewrite the Query to Prevent Recurrence¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Long-Running Query / Lock Contention¶

Quick Assessment (30 seconds)¶

Step 1: Find All Long-Running Queries and Transactions¶

Step 2: Find Lock Waiters and Their Blockers¶

Step 3: Attempt a Graceful Cancel First¶

Step 4: Terminate the Blocking Query If Cancel Fails¶

Step 5: Analyze the Problematic Query to Understand Why It Was Slow¶

Step 6: Add an Index or Rewrite the Query to Prevent Recurrence¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶