Database Internals Footguns¶

Mistakes that cause outages, data loss, or silent corruption in database operations.

1. Running `ALTER TABLE` on a Large Table Without Planning¶

You add a column to a 500M row table during business hours. The operation takes a table-level lock for 45 minutes. Every query to that table queues behind it. Your application throws connection pool exhaustion errors. Users see timeouts.

Fix: Use pt-online-schema-change (MySQL) or CREATE INDEX CONCURRENTLY (PostgreSQL). Schedule schema changes during maintenance windows. Test on a staging replica first.

Gotcha: In PostgreSQL, most ALTER TABLE operations (add column with default, change type) take an ACCESS EXCLUSIVE lock, blocking all reads and writes. Adding a nullable column with no default is one of the few operations that takes only a brief ACCESS EXCLUSIVE lock. In MySQL, pt-online-schema-change works by creating a shadow table, syncing with triggers, and swapping — avoiding long locks entirely.

2. No Connection Pooling¶

Each application instance opens its own connections. With 20 pods × 10 connections each, you hit PostgreSQL's max_connections limit. New connections are refused. The database isn't overloaded — it just can't accept more clients.

Fix: Use PgBouncer or ProxySQL in front of the database. Set max_connections appropriately. Monitor pg_stat_activity for connection counts.

Default trap: PostgreSQL's default max_connections is 100. Each connection consumes ~5-10MB of RAM (for work_mem, temp buffers, etc.). At 200 pods x 10 connections, you need 2,000 connections — 10-20GB just for connection overhead. PgBouncer in transaction mode lets 2,000 application connections share 50 actual database connections.

3. Missing Indexes on Foreign Keys¶

You have a DELETE FROM orders WHERE customer_id = 123. Without an index on customer_id, this triggers a sequential scan on a 100M row table. Cascading deletes amplify it — each FK check is another full scan.

Fix: Always index foreign key columns. Run EXPLAIN ANALYZE on DELETE/UPDATE queries that reference parent tables.

4. Trusting Replication Lag is "Close Enough"¶

Your read replica is 30 seconds behind. A user creates an order, then immediately views their orders list (routed to replica). They see nothing. They create it again. Now they have duplicate orders.

Fix: Route writes AND subsequent reads to the primary for a configurable window. Monitor pg_stat_replication lag. Alert on lag > N seconds.

Remember: This is "read-your-writes consistency." Common pattern: after a write, set a cookie or session flag with a timestamp. For subsequent reads within N seconds, route to primary. After N seconds, route to replica. Libraries like makara (Ruby) and django-replica handle this automatically.

5. VACUUM Never Runs (PostgreSQL)¶

Autovacuum is disabled or can't keep up. Dead tuples accumulate. Table bloat grows from 10GB to 200GB. Query performance degrades as the planner scans dead rows. Eventually transaction ID wraparound threatens data loss.

Fix: Never disable autovacuum. Tune autovacuum_vacuum_scale_factor for large tables. Monitor n_dead_tup in pg_stat_user_tables. Manual VACUUM VERBOSE if bloat is severe.

War story: Mailchimp/Mandrill experienced a ~40-hour outage in 2016 when autovacuum fell behind on a busy shard, triggering PostgreSQL's transaction ID wraparound protection. The database went read-only to prevent data corruption. Recovery required killing long-running transactions and running aggressive manual VACUUM FREEZE operations. PostgreSQL shuts down writes when it reaches ~1 million transactions before the 2-billion XID limit — this is a hard safety stop, not a soft warning.

6. Optimistic Locking Without Retry Logic¶

Two users edit the same record. Both read version 1. First saves (version → 2). Second gets "version conflict" error. Your app shows a generic 500 error instead of asking the user to retry.

Fix: Implement retry-with-merge in the application. Show the user what changed and let them resolve conflicts. Don't swallow optimistic lock exceptions.

7. Long-Running Transactions Blocking Everything¶

A developer opens psql, runs BEGIN, does a SELECT, then goes to lunch. That open transaction holds a snapshot, preventing VACUUM from cleaning dead tuples. Replication lag grows because WAL segments can't be recycled.

Fix: Set idle_in_transaction_session_timeout. Monitor pg_stat_activity for transactions open > 5 minutes. Kill idle transactions automatically.

Debug clue: Run SELECT pid, now() - xact_start AS duration, state, query FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY duration DESC; to find the offenders. An idle in transaction session holding a snapshot prevents VACUUM from cleaning dead tuples in any table that session might have read, causing table bloat to grow without bound.

8. Restoring a Backup Without Testing It¶

Your backup cron runs nightly. It hasn't been tested in 6 months. When you need it, the restore fails — the backup was silently corrupt, or the restore procedure doesn't match the current schema version.

Fix: Automate weekly restore tests to a staging environment. Verify row counts and checksums after restore. Document the exact restore procedure.

9. `SELECT *` in Application Queries¶

Your ORM generates SELECT * for every query. When someone adds a 10MB BLOB column, every list query now fetches 10MB per row. Memory usage spikes. Response times go from 50ms to 5 seconds.

Fix: Explicitly select only needed columns. Use EXPLAIN ANALYZE to catch queries pulling unnecessary data. Audit ORM-generated SQL regularly.

10. Failing Over to a Lagging Replica¶

Primary dies. You promote the replica that's 2 minutes behind. Those 2 minutes of transactions are lost — including payments, signups, and state changes. Customers were charged but have no orders.

Fix: Use synchronous replication for critical data (accept the latency cost). Check replay_lsn vs sent_lsn before promotion. Have a documented recovery procedure for the gap.