Skip to content

Pattern: Transaction ID Wraparound

ID: FP-030 Family: Silent Corruption Frequency: Uncommon Blast Radius: Single Service Detection Difficulty: Subtle

The Shape

PostgreSQL uses 32-bit transaction IDs (XIDs) to track visibility of rows. After approximately 2 billion transactions, the XID counter wraps around. Without VACUUM to "freeze" old row visibility information, rows become invisible (appear to not exist) or the database enters a safety shutdown. PostgreSQL proactively warns at 40 million transactions before wraparound; but if autovacuum is disabled or too slow, the warning is ignored and the database eventually becomes read-only to prevent data loss.

How You'll See It

In Linux/Infrastructure

FATAL: database is not accepting commands to avoid wraparound data loss
HINT: Stop the postmaster and vacuum that database in single-user mode.
All connections to the affected database fail. The database is alive but refusing all writes (and some reads). This manifests without any recent schema change or deployment. It's triggered by transaction volume over time, not any single event.

In Kubernetes

PostgreSQL StatefulSet: pods running, PVC healthy, but all application pods return database connection errors. kubectl exec into the postgres pod; psql shows the wraparound error. The issue is in the data layer, not the infrastructure.

In CI/CD

A high-transaction-rate test database used for parallel CI testing ages faster than production. Autovacuum can't keep up with 10,000 CI tests/hour. After 2 months, the test database hits wraparound shutdown.

The Tell

PostgreSQL logs: WARNING: database "mydb" must be vacuumed within N transactions or FATAL: database is not accepting commands to avoid wraparound data loss. SELECT age(datfrozenxid), datname FROM pg_database; shows age approaching 2,000,000,000. Autovacuum was disabled or lagging significantly.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Database crash Wraparound safety shutdown DB process running; psql connects but refuses commands with HINT about vacuum
OOM Wraparound Memory fine; specific error message references wraparound
Replication issue Local XID wraparound Replica healthy; only primary affected

The Fix (Generic)

  1. Immediate: Stop the database service; start in single-user mode: postgres --single -D /var/lib/postgresql/data mydb; run VACUUM FREEZE; manually.
  2. Short-term: Re-enable autovacuum; tune autovacuum_vacuum_cost_limit and autovacuum_naptime for high-transaction databases.
  3. Long-term: Monitor age(datfrozenxid) in Prometheus; alert when age exceeds 1.5 billion; never disable autovacuum without a clear alternative freezing strategy.

Real-World Examples

  • Example 1: High-traffic SaaS: autovacuum was disabled "for performance" during a migration and never re-enabled. After 6 months of heavy traffic, the database entered wraparound safety mode on a Saturday morning. Emergency VACUUM FREEZE took 4 hours.
  • Example 2: Analytics database processing 50 million transactions/day. Autovacuum was too slow; age reached 1.9 billion. Got WARNING: database must be vacuumed within 100,000,000 transactions. Tuned autovacuum aggressiveness; ran manual VACUUM FREEZE ANALYZE to buy time.

War Story

Saturday 6am page: "database completely down." Postgres was running; all connections returned FATAL with a hint about vacuum. None of us had seen this before. Stack Overflow confirmed: this is XID wraparound. The database had been processing 20 million rows/day since launch (18 months). Autovacuum had been disabled after it was blamed (incorrectly) for a performance issue. We spent 3 hours in single-user mode running VACUUM FREEZE. During that time: complete production outage. The warning had appeared in the logs for weeks; no one had an alert for it. We added SELECT age(datfrozenxid) FROM pg_database WHERE datname=current_database() to our Prometheus monitoring that same weekend.

Cross-References

  • Topic Packs: database-ops
  • Footguns: database-ops/footguns.md — "Ignoring VACUUM in PostgreSQL"
  • Related Patterns: FP-025 (untested backup — another silent accumulation that explodes in disaster), FP-007 (tmpfs RAM — same "gradual accumulation" pattern shape)