Skip to content

Runbook: Long-Running Query / Lock Contention

Field Value
Domain Databases
Alert pg_stat_activity_max_tx_duration > 300 (5 minutes) or deadlock errors in application logs
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 30 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites psql access, pg_stat_activity visibility, pg_cancel_backend or pg_terminate_backend permission

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
       now() - query_start AS duration,
       left(query, 80) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
"
If output shows: One or more queries running for minutes → Continue to Step 1 If output shows: All durations are short (under 30 seconds) → The query finished; the alert may be stale or the problem is intermittent — check application logs for deadlock messages and see postgres-conn-exhaustion.md if connections are still high

Step 1: Find All Long-Running Queries and Transactions

Why: You need to distinguish between a long-running query (bad plan, missing index) and a long-running idle transaction (application bug, connection leak) — the fix is different for each.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
       usename,
       application_name,
       now() - xact_start AS txn_duration,
       now() - query_start AS query_duration,
       state,
       wait_event_type,
       wait_event,
       left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY txn_duration DESC
LIMIT 10;
"
Expected output:
  pid  | usename  | txn_duration | query_duration | state  | wait_event_type | query
-------+----------+--------------+----------------+--------+-----------------+-------
 23456 | app_user | 00:12:34     | 00:12:34       | active | Lock            | UPDATE orders SET ...
 23457 | app_user | 00:01:23     | 00:00:00.1     | idle   |                 | COMMIT
wait_event_type = Lock means this query is waiting on another lock — it is blocked, not just slow. If this fails: Check that your user has pg_monitor role or is a superuser. Without it, you can only see your own queries.

Step 2: Find Lock Waiters and Their Blockers

Why: In a lock contention scenario, there is always a chain: one session holds a lock, others pile up waiting. You must kill the root blocker — killing a waiter does nothing, another waiter will immediately take its place.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT blocked_locks.pid          AS blocked_pid,
       blocked_activity.usename   AS blocked_user,
       blocking_locks.pid         AS blocking_pid,
       blocking_activity.usename  AS blocking_user,
       now() - blocked_activity.xact_start AS blocked_duration,
       left(blocked_activity.query, 80)    AS blocked_query,
       left(blocking_activity.query, 80)   AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks
     ON blocking_locks.locktype    = blocked_locks.locktype
    AND blocking_locks.relation    = blocked_locks.relation
    AND blocking_locks.granted
    AND NOT blocked_locks.granted
JOIN pg_catalog.pg_stat_activity blocked_activity
     ON blocked_activity.pid  = blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
     ON blocking_activity.pid = blocking_locks.pid;
"
Expected output:
 blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_duration | blocked_query
-------------+--------------+--------------+---------------+------------------+---------------
 23456       | app_user     | 12345        | app_user      | 00:05:12         | UPDATE orders...
The blocking_pid is your target. Note it down. If this fails: If no rows are returned but queries are still slow, the bottleneck may be I/O or CPU, not lock contention. Check with wait_event_type values: IO means disk bound, Client means waiting for the application.

Step 3: Attempt a Graceful Cancel First

Why: pg_cancel_backend sends SIGINT, which lets the query clean up and roll back the transaction cleanly. This is always preferable to a hard terminate because it avoids leaving the session in an inconsistent state.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_cancel_backend(<BLOCKING_PID>);
"
Expected output:
 pg_cancel_backend
-------------------
 t
Wait 10-15 seconds, then re-run the query from Step 1 to confirm the blocking query is gone. If this fails: The query may be in a non-cancellable state (e.g., waiting on a lock or executing a built-in function). Proceed to Step 4.

Step 4: Terminate the Blocking Query If Cancel Fails

Why: pg_terminate_backend sends SIGTERM, which forces the backend process to exit immediately. The transaction is rolled back, but the mechanism is less clean than a cancel.

psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pg_terminate_backend(<BLOCKING_PID>);
"
Expected output:
 pg_terminate_backend
----------------------
 t
Confirm the blocking session has ended:
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid FROM pg_stat_activity WHERE pid = <BLOCKING_PID>;
"
Expected: zero rows returned. If this fails: You may need superuser privileges. If you do not have them, escalate to the DBA on-call immediately.

Step 5: Analyze the Problematic Query to Understand Why It Was Slow

Why: Terminating the query is triage, not a fix. If the same query runs again, it will block again. You must understand why it was slow to prevent recurrence.

# Get the full query text from slow query log or pg_stat_statements
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT query, calls, mean_exec_time, max_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"

# Then EXPLAIN the specific query (replace with the actual query)
psql -h <DB_HOST> -U <DB_USER> -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<SLOW_QUERY_TEXT>;
"
Expected output: The EXPLAIN ANALYZE output shows the query plan. Look for Seq Scan on large tables (suggests a missing index) or Nested Loop on large result sets (suggests a join strategy issue). If this fails: pg_stat_statements may not be enabled. Check SHOW shared_preload_libraries; — it must include pg_stat_statements.

Step 6: Add an Index or Rewrite the Query to Prevent Recurrence

Why: The permanent fix is structural. Either the query needs a better execution plan (via an index) or the application code needs to be changed to avoid the long-running transaction.

# Create an index concurrently to avoid locking the table during creation
psql -h <DB_HOST> -U <DB_USER> -c "
CREATE INDEX CONCURRENTLY idx_<TABLE>_<COLUMN>
ON <SCHEMA>.<TABLE> (<COLUMN>);
"

# Set a statement timeout to auto-kill queries that exceed a threshold (add to postgresql.conf or per-role)
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
ALTER ROLE <APP_USER> SET statement_timeout = '5min';
"
Expected output:
CREATE INDEX
If this fails: CREATE INDEX CONCURRENTLY cannot run inside a transaction. If you see that error, you are inside a BEGIN block — run ROLLBACK first, then retry.

Verification

# Confirm the issue is resolved
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS blocked_queries
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
"
Success looks like: blocked_queries = 0, and the query that was slow now completes in under a second. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 30 min DBA on-call "Lock contention on : pid blocking queries; unable to terminate with available permissions"
Data loss suspected DBA / Data Lead "Possible data loss: forced termination of a write transaction on , transaction may not have committed"
Scope expanding Platform team "Lock chain growing — multiple tables affected on ; application writes are stalled"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • Set lock_timeout and statement_timeout for application roles to prevent future runaway transactions
  • Review application connection pool settings — idle-in-transaction sessions often come from pool misconfiguration
  • Enable log_lock_waits = on and deadlock_timeout = 1s for better observability of future incidents

Common Mistakes

  1. Terminating the blocked query instead of the blocking one: This does nothing to resolve the lock contention — the blocking session still holds the lock. Always identify and kill the blocking_pid, not the blocked_pid.
  2. Not attempting pg_cancel_backend before pg_terminate_backend: Cancel gives the query a chance to roll back cleanly. Terminate is forceful and leaves less room for the application to handle the interruption gracefully.
  3. Confusing wait_event_type values: Lock means waiting for a database lock (another session). IO means waiting for disk reads/writes. Client means waiting for the application to send data. Each requires a different response.
  4. Running EXPLAIN ANALYZE on the live production table for a destructive query: EXPLAIN ANALYZE actually executes the query. If the query is a DELETE or UPDATE, wrapping it in a transaction you can roll back is essential: BEGIN; EXPLAIN ANALYZE <QUERY>; ROLLBACK;

Cross-References

  • Topic Pack: PostgreSQL Query Performance (deep background on EXPLAIN output, index types, and query planning)
  • Related Runbook: postgres-conn-exhaustion.md — long transactions hold connections open, causing exhaustion
  • Related Runbook: postgres-disk-space.md — bloated tables from missed VACUUM are often behind slow sequential scans

Wiki Navigation