Anti-Primer: Database Ops¶
Everything that can go wrong, will — and in this story, it does.
The Setup¶
A DBA is performing maintenance on a production Database Ops database serving a high-traffic application. The maintenance window is 2 hours, and the team has not rehearsed the procedure. The DBA is working alone at midnight.
The Timeline¶
Hour 0: No Backup Before Migration¶
Starts a schema migration without taking a backup first. The deadline was looming, and this seemed like the fastest path forward. But the result is migration fails halfway; data is in an inconsistent state; no clean restore point.
Footgun #1: No Backup Before Migration — starts a schema migration without taking a backup first, leading to migration fails halfway; data is in an inconsistent state; no clean restore point.
Nobody notices yet. The engineer moves on to the next task.
Hour 1: Long-Running Lock¶
Runs an ALTER TABLE on a large table without understanding the locking implications. Under time pressure, the team chose speed over caution. But the result is table is locked for 30 minutes; all application queries queue; users see timeouts and errors.
Footgun #2: Long-Running Lock — runs an ALTER TABLE on a large table without understanding the locking implications, leading to table is locked for 30 minutes; all application queries queue; users see timeouts and errors.
The first mistake is still invisible, making the next shortcut feel justified.
Hour 2: Missing Index on New Query¶
Deploys application code with a new query that hits an unindexed column. Nobody pushed back because the shortcut looked harmless in the moment. But the result is full table scan on every request; database CPU hits 100%; all queries slow to a crawl.
Footgun #3: Missing Index on New Query — deploys application code with a new query that hits an unindexed column, leading to full table scan on every request; database CPU hits 100%; all queries slow to a crawl.
Pressure is mounting. The team is behind schedule and cutting more corners.
Hour 3: Connection Pool Exhaustion¶
Does not configure connection pool limits; each microservice opens unlimited connections. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is database hits max_connections; new requests are rejected; cascading failures across all services.
Footgun #4: Connection Pool Exhaustion — does not configure connection pool limits; each microservice opens unlimited connections, leading to database hits max_connections; new requests are rejected; cascading failures across all services.
By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.
The Postmortem¶
Root Cause Chain¶
| # | Mistake | Consequence | Could Have Been Prevented By |
|---|---|---|---|
| 1 | No Backup Before Migration | Migration fails halfway; data is in an inconsistent state; no clean restore point | Primer: Always take and verify a backup before any schema change or migration |
| 2 | Long-Running Lock | Table is locked for 30 minutes; all application queries queue; users see timeouts and errors | Primer: Use online DDL tools (pt-online-schema-change, gh-ost) for large tables |
| 3 | Missing Index on New Query | Full table scan on every request; database CPU hits 100%; all queries slow to a crawl | Primer: Review query plans for new queries; add indexes before deploying code that needs them |
| 4 | Connection Pool Exhaustion | Database hits max_connections; new requests are rejected; cascading failures across all services | Primer: Configure connection pool size per service; use connection pooling proxies (PgBouncer, ProxySQL) |
Damage Report¶
- Downtime: 2-8 hours of database unavailability or degraded performance
- Data loss: High risk: data corruption, partial writes, or permanent loss possible
- Customer impact: Transaction failures, stale reads, or complete service outage for all users
- Engineering time to remediate: 16-32 engineer-hours for recovery, verification, and data reconciliation
- Reputation cost: Severe: customer data integrity concerns; possible regulatory notification
What the Primer Teaches¶
- Footgun #1: If the engineer had read the primer, section on no backup before migration, they would have learned: Always take and verify a backup before any schema change or migration.
- Footgun #2: If the engineer had read the primer, section on long-running lock, they would have learned: Use online DDL tools (pt-online-schema-change, gh-ost) for large tables.
- Footgun #3: If the engineer had read the primer, section on missing index on new query, they would have learned: Review query plans for new queries; add indexes before deploying code that needs them.
- Footgun #4: If the engineer had read the primer, section on connection pool exhaustion, they would have learned: Configure connection pool size per service; use connection pooling proxies (PgBouncer, ProxySQL).
Cross-References¶
- Primer — The right way
- Footguns — The mistakes catalogued
- Street Ops — How to do it in practice