Database Ops Footguns¶

Mistakes that lose data, corrupt databases, or turn a minor issue into a major outage.

1. Backup that's never been tested¶

You have automated backups running every night. They've been running for 2 years. One day you need to restore. The backup is corrupted. Or the restore process doesn't work. Or the backup only captures the schema, not the data.

Fix: Test restores monthly. Restore to a test environment and verify data integrity. Alert on backup job failures. Verify backup size — a suddenly small backup means something is wrong.

2. Running ALTER TABLE on a large table during peak traffic¶

You run ALTER TABLE users ADD COLUMN phone VARCHAR(20) on a 50-million-row table. PostgreSQL takes a heavy lock. All queries to that table block for 30 minutes. Your application times out, connection pool exhausts, cascading failures.

Fix: Use pt-online-schema-change (MySQL) or CREATE INDEX CONCURRENTLY (PostgreSQL). For column additions in PostgreSQL, ALTER TABLE ... ADD COLUMN with a default is instant in PG 11+. Schedule heavy migrations during maintenance windows.

3. StatefulSet with wrong `podManagementPolicy`¶

Your StatefulSet uses OrderedReady (default). To scale from 1 to 5, pods launch one at a time, each waiting for the previous to be Ready. Pod 2 needs pod 1 as a replication source. But pod 1 isn't ready because it's waiting for the initial data load. Deadlock.

Fix: Use podManagementPolicy: Parallel when pods don't have strict ordering requirements. Or ensure each pod can start independently and join the cluster asynchronously.

4. Connection pool exhaustion¶

Your application creates a new database connection for every request. Under load, you hit the database's max_connections limit (default 100 in PostgreSQL). New requests can't get a connection. The application returns 500 errors while the database is barely utilized.

Fix: Use connection pooling (PgBouncer, ProxySQL). Set pool size based on max_connections / number_of_app_instances. Monitor idle_in_transaction connections — they hold resources without doing work.

5. Deleting a PVC thinking you can get the data back¶

You delete a PVC with reclaimPolicy: Delete (the default for most cloud StorageClasses). The underlying EBS volume or persistent disk is immediately deleted. Your data is gone. There is no undo.

Fix: Set reclaimPolicy: Retain on StorageClasses used for databases. Use volumeClaimTemplates in StatefulSets (these PVCs survive pod deletion). Back up before any PVC operations.

6. Running `DROP DATABASE` on the wrong connection¶

You have two terminal tabs open — one connected to prod, one to staging. You type DROP DATABASE app in the wrong one. Production database is gone.

Fix: Use different prompts for different environments. Set \set PROMPT1 '%[%033[1;31m%]PRODUCTION %/%R%# %[%033[0m%]' in psql for production. Use read-only users for most access.

7. Failover without checking replication lag¶

Your primary database crashes. The operator promotes the replica. The replica was 5 minutes behind. You just lost 5 minutes of transactions. Users see their recent orders disappear.

Fix: Monitor replication lag continuously. Alert when lag exceeds your RPO (Recovery Point Objective). Use synchronous replication for zero data loss (at the cost of latency). Understand your RPO and communicate it.

8. No PITR, only full backups¶

You take full backups daily. A developer runs a bad migration at 3pm that deletes half the data. Your last full backup is from midnight. You lose 15 hours of data.

Fix: Enable continuous WAL archiving (PostgreSQL) or binlog shipping (MySQL) for Point-in-Time Recovery. You can restore to any second between backups.

9. Scaling read replicas without checking query patterns¶

You add read replicas to handle more load. But your application sends all queries to the primary because the ORM doesn't support read/write splitting. Or writes go to replicas because the application doesn't distinguish read and write connections.

Fix: Configure your ORM or connection library for read/write splitting. Use a proxy like PgBouncer with routing rules. Verify queries are actually hitting replicas with pg_stat_replication.

10. Ignoring `VACUUM` in PostgreSQL¶

You disable autovacuum because "it causes I/O spikes." Transaction ID wraparound approaches. PostgreSQL enters safety shutdown mode and refuses all writes to prevent data corruption. Your production database is now read-only.

Fix: Never disable autovacuum. Tune it instead — adjust autovacuum_vacuum_cost_limit for less aggressive I/O. Monitor pg_stat_user_tables.n_dead_tup and age(datfrozenxid).