Database Operations — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about database operations.
The largest database migrations can take years to complete¶
GitHub's migration from MySQL's unsigned int to bigint for primary keys took over a year of engineering work, completing in 2023. The migration had to be done online with zero downtime on tables containing billions of rows, processing roughly 300 million rows per day. They built a custom tool called gh-ost specifically for online schema migrations.
Amazon's DynamoDB was born from a Christmas Day outage¶
Amazon's 2004 holiday season saw database-related outages that cost millions in lost sales. This led to the development of Dynamo (described in the famous 2007 paper) and eventually DynamoDB (launched 2012). Werner Vogels, Amazon's CTO, has said the company's database operational pain directly motivated the design of a "never go down" key-value store.
Google runs database operations across 5 continents simultaneously¶
Google's Spanner database, described in a 2012 paper, uses GPS receivers and atomic clocks in every data center to synchronize time across globally distributed nodes. This allows Spanner to provide externally consistent reads across continents with single-digit millisecond precision. The TrueTime API it depends on was considered so unusual that many researchers initially doubted it was practical.
A missing index once cost a company $1 million per day in cloud compute¶
A well-documented case study describes a SaaS company whose primary query suddenly started full-table scanning after a statistics update changed the query plan. The resulting CPU spike required scaling from 4 to 64 database instances. The root cause was a missing composite index that would have cost zero dollars to create.
Database connection pooling was invented because TCP handshakes were too slow¶
PgBouncer, one of the most popular PostgreSQL connection poolers, was created in 2007 because establishing a new PostgreSQL connection requires forking a new process (about 10-20ms). A busy application opening 1,000 connections per second would spend all its time forking. PgBouncer maintains a pool of pre-forked connections, reducing connection time to under 1ms.
The "VACUUM" command in PostgreSQL was considered a design flaw for years¶
PostgreSQL's need for periodic VACUUM operations to reclaim dead tuple space was criticized as an operational burden throughout the 2000s. Autovacuum was added in version 8.1 (2005), but it was disabled by default. It was not enabled by default until version 8.3 (2008). Even today, tuning autovacuum parameters remains one of the most common PostgreSQL operational tasks.
Point-in-time recovery has saved companies from extinction¶
In 2017, GitLab suffered a database incident where an engineer accidentally ran rm -rf on a production PostgreSQL data directory. Five backup methods failed, but one worked — a recent LVM snapshot. Without point-in-time recovery capabilities, they would have lost 6 hours of data. The incident was livestreamed on YouTube and watched by thousands of engineers.
Database failover is the most rehearsed and least trusted operation in SRE¶
Despite automation tools like Patroni, Orchestrator, and RDS Multi-AZ, database failover remains the operation that SRE teams rehearse most frequently and trust least. A 2022 Percona survey found that 40% of organizations had experienced at least one failed database failover in the past year, with split-brain scenarios being the most feared outcome.
Read replicas were popularized by a 2004 LiveJournal blog post¶
Brad Fitzpatrick of LiveJournal wrote extensively in 2003-2004 about using MySQL read replicas to scale their social networking site. His posts and presentations at conferences like OSCON popularized the master-replica pattern that became the default scaling architecture for MySQL-backed web applications throughout the Web 2.0 era.