Pattern: Transaction ID Wraparound¶
ID: FP-030 Family: Silent Corruption Frequency: Uncommon Blast Radius: Single Service Detection Difficulty: Subtle
The Shape¶
PostgreSQL uses 32-bit transaction IDs (XIDs) to track visibility of rows. After approximately 2 billion transactions, the XID counter wraps around. Without VACUUM to "freeze" old row visibility information, rows become invisible (appear to not exist) or the database enters a safety shutdown. PostgreSQL proactively warns at 40 million transactions before wraparound; but if autovacuum is disabled or too slow, the warning is ignored and the database eventually becomes read-only to prevent data loss.
How You'll See It¶
In Linux/Infrastructure¶
FATAL: database is not accepting commands to avoid wraparound data loss
HINT: Stop the postmaster and vacuum that database in single-user mode.
In Kubernetes¶
PostgreSQL StatefulSet: pods running, PVC healthy, but all application pods return
database connection errors. kubectl exec into the postgres pod; psql shows the
wraparound error. The issue is in the data layer, not the infrastructure.
In CI/CD¶
A high-transaction-rate test database used for parallel CI testing ages faster than production. Autovacuum can't keep up with 10,000 CI tests/hour. After 2 months, the test database hits wraparound shutdown.
The Tell¶
PostgreSQL logs:
WARNING: database "mydb" must be vacuumed within N transactionsorFATAL: database is not accepting commands to avoid wraparound data loss.SELECT age(datfrozenxid), datname FROM pg_database;shows age approaching 2,000,000,000. Autovacuum was disabled or lagging significantly.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Database crash | Wraparound safety shutdown | DB process running; psql connects but refuses commands with HINT about vacuum |
| OOM | Wraparound | Memory fine; specific error message references wraparound |
| Replication issue | Local XID wraparound | Replica healthy; only primary affected |
The Fix (Generic)¶
- Immediate: Stop the database service; start in single-user mode:
postgres --single -D /var/lib/postgresql/data mydb; runVACUUM FREEZE;manually. - Short-term: Re-enable autovacuum; tune
autovacuum_vacuum_cost_limitandautovacuum_naptimefor high-transaction databases. - Long-term: Monitor
age(datfrozenxid)in Prometheus; alert when age exceeds 1.5 billion; never disable autovacuum without a clear alternative freezing strategy.
Real-World Examples¶
- Example 1: High-traffic SaaS: autovacuum was disabled "for performance" during a migration and never re-enabled. After 6 months of heavy traffic, the database entered wraparound safety mode on a Saturday morning. Emergency VACUUM FREEZE took 4 hours.
- Example 2: Analytics database processing 50 million transactions/day. Autovacuum was too slow; age reached 1.9 billion. Got
WARNING: database must be vacuumed within 100,000,000 transactions. Tuned autovacuum aggressiveness; ran manualVACUUM FREEZE ANALYZEto buy time.
War Story¶
Saturday 6am page: "database completely down." Postgres was running; all connections returned FATAL with a hint about vacuum. None of us had seen this before. Stack Overflow confirmed: this is XID wraparound. The database had been processing 20 million rows/day since launch (18 months). Autovacuum had been disabled after it was blamed (incorrectly) for a performance issue. We spent 3 hours in single-user mode running VACUUM FREEZE. During that time: complete production outage. The warning had appeared in the logs for weeks; no one had an alert for it. We added
SELECT age(datfrozenxid) FROM pg_database WHERE datname=current_database()to our Prometheus monitoring that same weekend.
Cross-References¶
- Topic Packs: database-ops
- Footguns: database-ops/footguns.md — "Ignoring VACUUM in PostgreSQL"
- Related Patterns: FP-025 (untested backup — another silent accumulation that explodes in disaster), FP-007 (tmpfs RAM — same "gradual accumulation" pattern shape)