Database Internals — Trivia & Interesting Facts¶

Surprising, historical, and little-known facts about database internals.

The B-tree was invented in 1970 and still dominates database indexing¶

Rudolf Bayer and Edward McCreight invented the B-tree at Boeing Scientific Research Labs in 1970. Over 50 years later, B-trees (and their B+ tree variants) remain the primary index structure in virtually every relational database. The "B" in B-tree has never been officially defined — theories include "Bayer," "Boeing," "balanced," or "broad."

PostgreSQL's MVCC design means deleted rows are not actually deleted¶

In PostgreSQL's Multi-Version Concurrency Control implementation, a DELETE operation does not remove the row from disk. Instead, it marks the row's "xmax" field with the deleting transaction's ID. The dead row remains on disk until VACUUM reclaims the space. This is why unvacuumed PostgreSQL tables can grow to many times their logical size — a phenomenon called "table bloat."

Write-Ahead Logging was formalized in a 1992 ARIES paper¶

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) paper by C. Mohan et al. at IBM Research in 1992 formalized Write-Ahead Logging (WAL) recovery. At 68 pages, it is one of the most cited papers in database research. Nearly every modern database — from PostgreSQL to SQLite to MySQL's InnoDB — implements some variant of ARIES-style WAL recovery.

LSM-trees were designed for write-heavy workloads on slow disks¶

The Log-Structured Merge-tree was described by Patrick O'Neil et al. in 1996. LSM-trees buffer writes in memory and periodically flush sorted runs to disk, converting random writes into sequential I/O. Google's LevelDB (2011) and Facebook's RocksDB (2012) popularized LSM-trees, and they now power Cassandra, CockroachDB, TiKV, and dozens of other systems.

Buffer pool hit ratios below 99% often indicate serious performance problems¶

Database buffer pools (in-memory page caches) in well-tuned OLTP systems typically achieve hit ratios of 99% or higher. A buffer pool hit ratio below 99% often means the working set does not fit in memory, causing frequent disk reads. In production InnoDB systems, a drop from 99.9% to 99% can double query latency because the 10x increase in cache misses translates directly to disk I/O.

The query optimizer in a modern database evaluates millions of plans per query¶

A complex JOIN query across 10 tables has over 17 billion possible join orderings. Modern query optimizers use dynamic programming, cost estimation, and heuristic pruning to search this space efficiently. PostgreSQL's optimizer, for example, switches from exhaustive search to a genetic algorithm (GEQO) when queries involve more than 12 tables because the search space becomes intractable.

Database page sizes have been 4-8 KB since the 1970s¶

Most databases use 4 KB (SQLite), 8 KB (PostgreSQL, SQL Server), or 16 KB (InnoDB) page sizes — values rooted in the physical characteristics of hard drives from the 1970s-1980s. These sizes persist even though modern SSDs have fundamentally different performance characteristics. Changing the page size in a production database is extremely difficult and rarely done.

Phantom reads were considered acceptable for 30 years¶

The SQL standard defined four isolation levels in SQL-92: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. For decades, most applications ran at Read Committed (PostgreSQL's default) or Repeatable Read (MySQL/InnoDB's default), accepting phantom reads as a tradeoff. It was not until the 2010s that serializable isolation became practical at scale, with innovations like Serializable Snapshot Isolation (SSI) in PostgreSQL 9.1 (2011).

A single index can make a query 10,000x faster¶

On a 10-million-row table, a query without an appropriate index might perform a full table scan reading all 10 million rows. With a B-tree index, the same query touches only 3-4 index levels plus the target row — roughly 4 page reads instead of 40,000. This 10,000x speedup is not an exaggeration; it is basic B-tree mathematics and the primary reason indexes exist.

Checksums in database pages catch corruption that would otherwise go undetected for years¶

InnoDB stores a checksum in every 16 KB page. When Facebook analyzed their MySQL fleet, they found that approximately 1 in 10 billion page reads encountered a checksum mismatch, indicating silent data corruption from hardware, firmware, or cosmic rays. Without page checksums, this corruption would propagate silently through backups and replicas.