- devops
- l2
- runbook
- postgresql
- locking --- Portal | Level: L2: Operations | Topics: PostgreSQL Operations, Database Locking | Domain: DevOps & Tooling
Runbook: Long-Running Query / Lock Contention¶
| Field | Value |
|---|---|
| Domain | Databases |
| Alert | pg_stat_activity_max_tx_duration > 300 (5 minutes) or deadlock errors in application logs |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 30 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | psql access, pg_stat_activity visibility, pg_cancel_backend or pg_terminate_backend permission |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
now() - query_start AS duration,
left(query, 80) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
"
Step 1: Find All Long-Running Queries and Transactions¶
Why: You need to distinguish between a long-running query (bad plan, missing index) and a long-running idle transaction (application bug, connection leak) — the fix is different for each.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT pid,
usename,
application_name,
now() - xact_start AS txn_duration,
now() - query_start AS query_duration,
state,
wait_event_type,
wait_event,
left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY txn_duration DESC
LIMIT 10;
"
pid | usename | txn_duration | query_duration | state | wait_event_type | query
-------+----------+--------------+----------------+--------+-----------------+-------
23456 | app_user | 00:12:34 | 00:12:34 | active | Lock | UPDATE orders SET ...
23457 | app_user | 00:01:23 | 00:00:00.1 | idle | | COMMIT
wait_event_type = Lock means this query is waiting on another lock — it is blocked, not just slow.
If this fails: Check that your user has pg_monitor role or is a superuser. Without it, you can only see your own queries.
Step 2: Find Lock Waiters and Their Blockers¶
Why: In a lock contention scenario, there is always a chain: one session holds a lock, others pile up waiting. You must kill the root blocker — killing a waiter does nothing, another waiter will immediately take its place.
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
now() - blocked_activity.xact_start AS blocked_duration,
left(blocked_activity.query, 80) AS blocked_query,
left(blocking_activity.query, 80) AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.granted
AND NOT blocked_locks.granted
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid;
"
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_duration | blocked_query
-------------+--------------+--------------+---------------+------------------+---------------
23456 | app_user | 12345 | app_user | 00:05:12 | UPDATE orders...
blocking_pid is your target. Note it down.
If this fails: If no rows are returned but queries are still slow, the bottleneck may be I/O or CPU, not lock contention. Check with wait_event_type values: IO means disk bound, Client means waiting for the application.
Step 3: Attempt a Graceful Cancel First¶
Why: pg_cancel_backend sends SIGINT, which lets the query clean up and roll back the transaction cleanly. This is always preferable to a hard terminate because it avoids leaving the session in an inconsistent state.
Step 4: Terminate the Blocking Query If Cancel Fails¶
Why: pg_terminate_backend sends SIGTERM, which forces the backend process to exit immediately. The transaction is rolled back, but the mechanism is less clean than a cancel.
Step 5: Analyze the Problematic Query to Understand Why It Was Slow¶
Why: Terminating the query is triage, not a fix. If the same query runs again, it will block again. You must understand why it was slow to prevent recurrence.
# Get the full query text from slow query log or pg_stat_statements
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT query, calls, mean_exec_time, max_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
"
# Then EXPLAIN the specific query (replace with the actual query)
psql -h <DB_HOST> -U <DB_USER> -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<SLOW_QUERY_TEXT>;
"
EXPLAIN ANALYZE output shows the query plan. Look for Seq Scan on large tables (suggests a missing index) or Nested Loop on large result sets (suggests a join strategy issue).
If this fails: pg_stat_statements may not be enabled. Check SHOW shared_preload_libraries; — it must include pg_stat_statements.
Step 6: Add an Index or Rewrite the Query to Prevent Recurrence¶
Why: The permanent fix is structural. Either the query needs a better execution plan (via an index) or the application code needs to be changed to avoid the long-running transaction.
# Create an index concurrently to avoid locking the table during creation
psql -h <DB_HOST> -U <DB_USER> -c "
CREATE INDEX CONCURRENTLY idx_<TABLE>_<COLUMN>
ON <SCHEMA>.<TABLE> (<COLUMN>);
"
# Set a statement timeout to auto-kill queries that exceed a threshold (add to postgresql.conf or per-role)
psql -h <DB_HOST> -U <DB_SUPERUSER> -c "
ALTER ROLE <APP_USER> SET statement_timeout = '5min';
"
CREATE INDEX CONCURRENTLY cannot run inside a transaction. If you see that error, you are inside a BEGIN block — run ROLLBACK first, then retry.
Verification¶
# Confirm the issue is resolved
psql -h <DB_HOST> -U <DB_USER> -c "
SELECT count(*) AS blocked_queries
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
"
blocked_queries = 0, and the query that was slow now completes in under a second.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 30 min | DBA on-call | "Lock contention on |
| Data loss suspected | DBA / Data Lead | "Possible data loss: forced termination of a write transaction on |
| Scope expanding | Platform team | "Lock chain growing — multiple tables affected on |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Set
lock_timeoutandstatement_timeoutfor application roles to prevent future runaway transactions - Review application connection pool settings — idle-in-transaction sessions often come from pool misconfiguration
- Enable
log_lock_waits = onanddeadlock_timeout = 1sfor better observability of future incidents
Common Mistakes¶
- Terminating the blocked query instead of the blocking one: This does nothing to resolve the lock contention — the blocking session still holds the lock. Always identify and kill the
blocking_pid, not theblocked_pid. - Not attempting
pg_cancel_backendbeforepg_terminate_backend: Cancel gives the query a chance to roll back cleanly. Terminate is forceful and leaves less room for the application to handle the interruption gracefully. - Confusing
wait_event_typevalues:Lockmeans waiting for a database lock (another session).IOmeans waiting for disk reads/writes.Clientmeans waiting for the application to send data. Each requires a different response. - Running
EXPLAIN ANALYZEon the live production table for a destructive query:EXPLAIN ANALYZEactually executes the query. If the query is a DELETE or UPDATE, wrapping it in a transaction you can roll back is essential:BEGIN; EXPLAIN ANALYZE <QUERY>; ROLLBACK;
Cross-References¶
- Topic Pack: PostgreSQL Query Performance (deep background on EXPLAIN output, index types, and query planning)
- Related Runbook: postgres-conn-exhaustion.md — long transactions hold connections open, causing exhaustion
- Related Runbook: postgres-disk-space.md — bloated tables from missed VACUUM are often behind slow sequential scans
Wiki Navigation¶
Related Content¶
- PostgreSQL Operations (Topic Pack, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Connection Exhaustion (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Disk Space Critical (Runbook, L2) — PostgreSQL Operations
- Runbook: PostgreSQL Replication Lag (Runbook, L2) — PostgreSQL Operations