Skip to content

Database Internals for Ops - Primer

Why This Matters

DevOps owns the database infrastructure — provisioning, backups, failover, monitoring, and scaling. You do not need to write SQL all day, but you absolutely need to understand what is happening inside the engine when a page takes 12 seconds to load at 2 AM and the on-call Slack channel is on fire. Misunderstanding replication lag has caused more split-brain incidents than any network partition. Ignoring index bloat has brought more production databases to their knees than any traffic spike. This primer gives you the internals knowledge to prevent outages, diagnose them fast, and talk to DBAs as a peer.

PostgreSQL is the primary example throughout — it is the most common database in modern DevOps stacks. MySQL differences are noted where they matter.


1. Replication

How It Works

Name origin: The term WAL (Write-Ahead Log) means exactly what it says: writes go to the log ahead of (before) being applied to the actual data files. This guarantees that even if the server crashes mid-write, the WAL can replay the changes on recovery. MySQL calls its equivalent the "binary log" (binlog), but the principle is identical.

Replication copies data from a primary (read-write) to one or more replicas (read-only). In PostgreSQL, this is built on WAL (Write-Ahead Log) shipping. Every change is first written to WAL, then streamed to replicas.

-- Check replication status on the primary
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;

Synchronous vs Asynchronous

Mode Behavior Trade-off
Asynchronous Primary does not wait for replica acknowledgment Fast writes, risk of data loss on failover
Synchronous Primary waits for at least one replica to confirm Slower writes, zero data loss on failover
-- PostgreSQL: set synchronous replication
ALTER SYSTEM SET synchronous_standby_names = 'replica1';
SELECT pg_reload_conf();

In MySQL, the equivalent is semi-synchronous replication (rpl_semi_sync_master_enabled), which waits for at least one replica to acknowledge receipt of the binary log event before committing.

Replication Lag

Replication lag is the delay between a write on the primary and its availability on a replica. Monitor it constantly:

# PostgreSQL: check lag in seconds (on replica)
psql -c "SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
         THEN 0
         ELSE EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())
         END AS lag_seconds;"

Common causes of lag: replica under-provisioned (CPU/IO), long-running queries on replica, network saturation, large transactions.

Failover

When the primary goes down, a replica must be promoted:

# PostgreSQL 12+: promote a standby to primary
pg_ctl promote -D /var/lib/postgresql/data

# Or via SQL
SELECT pg_promote();

Automated failover tools: Patroni (PostgreSQL), Orchestrator (MySQL), pg_auto_failover.

Split-Brain

Split-brain occurs when two nodes both believe they are the primary and accept writes. This causes data divergence that is extremely difficult to resolve.

Fun fact: STONITH stands for "Shoot The Other Node In The Head." Despite the violent name, it is a standard term in high-availability clustering. The idea is that when you cannot confirm a node is down, the safest action is to forcibly power it off (via IPMI, iLO, or cloud API) before promoting a new primary. Better to kill a node that might be alive than risk two primaries writing simultaneously.

Prevention strategies: - Use a consensus mechanism (Patroni uses etcd/ZooKeeper/Consul) - Fencing: STONITH (Shoot The Other Node In The Head) — power off the old primary - Quorum-based decisions: require majority agreement before promotion


2. Transactions

Remember: ACID mnemonic: All or nothing (Atomicity), Consistent state transitions (Consistency), Independent transactions (Isolation), Durable after commit (Durability). If you can remember "ACID = reliable transactions," the individual properties follow logically.

ACID Properties

Property Meaning Ops Impact
Atomicity All or nothing — partial commits do not exist Failed migrations roll back cleanly
Consistency Data moves from one valid state to another Constraint violations are caught, not silently corrupted
Isolation Concurrent transactions do not interfere Determines what dirty reads you might see
Durability Committed data survives crashes WAL + fsync guarantees; disabling fsync is dangerous

Isolation Levels

From weakest to strongest:

Level Dirty Reads Non-Repeatable Reads Phantom Reads PostgreSQL Default
Read Uncommitted Yes Yes Yes No (treated as Read Committed)
Read Committed No Yes Yes Yes (default)
Repeatable Read No No Yes (PG: No) No
Serializable No No No No
-- Set isolation level for a transaction
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- ... queries ...
COMMIT;

-- Check current isolation level
SHOW transaction_isolation;

MySQL note: MySQL's default is REPEATABLE READ, and its implementation uses gap locks to prevent phantom reads — different from PostgreSQL's MVCC-based approach.

Deadlocks

A deadlock occurs when two transactions each hold a lock the other needs. PostgreSQL automatically detects deadlocks and kills one transaction:

-- Check for deadlocks in logs
-- postgresql.conf: log_lock_waits = on, deadlock_timeout = 1s
# Search PostgreSQL logs for deadlocks
grep -i "deadlock detected" /var/log/postgresql/postgresql-*.log

Long-Running Transactions

Long transactions hold locks, bloat MVCC vacuum, and block DDL operations. Find and kill them:

-- Find transactions running longer than 5 minutes
SELECT pid, now() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
  AND xact_start < now() - interval '5 minutes'
ORDER BY duration DESC;

-- Terminate a specific backend
SELECT pg_terminate_backend(12345);

Set a statement timeout to prevent runaway queries:

-- Per-session timeout
SET statement_timeout = '30s';

-- Global default (postgresql.conf)
-- statement_timeout = 30000   # milliseconds

3. Indexing

B-Tree Indexes

Under the hood: A B-tree index in PostgreSQL is a balanced tree where each internal node contains keys and pointers to child pages. A typical B-tree with 8KB pages and 8-byte keys can hold ~500 keys per page. A three-level B-tree can index ~125 million rows, meaning most lookups require only 3 disk reads (often cached). This is why index lookups are O(log N) and feel instantaneous even on large tables.

B-tree is the default and most common index type. It supports equality and range queries on sortable data:

-- Create a basic index
CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Create a composite index (column order matters)
CREATE INDEX idx_orders_cust_date ON orders (customer_id, created_at);

Other index types in PostgreSQL: GIN (full-text, JSONB), GiST (geometric, range), BRIN (large sequential tables), Hash (equality only).

Query Plans with EXPLAIN

EXPLAIN ANALYZE is your most important diagnostic tool. It shows the actual execution plan and timing:

-- Always use ANALYZE for real timings (runs the query)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2025-01-01';

Key things to look for: - Seq Scan on large tables — missing index - Nested Loop with high row counts — consider join strategy - Buffers: shared hit vs read — cache efficiency - Actual rows vs planned rows — stale statistics (run ANALYZE)

-- Update table statistics for the query planner
ANALYZE orders;

-- Full database statistics refresh
ANALYZE;

Missing Indexes

Find slow queries caused by missing indexes:

-- PostgreSQL: find sequential scans on large tables
SELECT schemaname, relname, seq_scan, seq_tup_read, idx_scan,
       CASE WHEN seq_scan > 0 THEN seq_tup_read / seq_scan ELSE 0 END AS avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;

Enable pg_stat_statements to find the most expensive queries:

-- Top 10 queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Index Bloat

Indexes accumulate dead tuples over time. Bloated indexes waste disk and slow queries:

-- Check index sizes vs table sizes
SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
       idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Fix bloat with REINDEX (locks the table) or CREATE INDEX CONCURRENTLY + swap:

-- Online reindex (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_orders_customer;

Covering Indexes

A covering index includes all columns needed by a query, avoiding a table lookup entirely (index-only scan):

-- Include non-key columns for covering
CREATE INDEX idx_orders_cover ON orders (customer_id) INCLUDE (total, status);

-- This query can be satisfied from the index alone
SELECT total, status FROM orders WHERE customer_id = 42;

MySQL note: MySQL InnoDB has implicit covering via the clustered index (primary key). Use EXPLAIN and look for Using index in the Extra column.


4. Locking

Lock Types

PostgreSQL uses a multi-level locking system:

Lock Level Conflicts With Typical Cause
RowShareLock Exclusive SELECT FOR UPDATE
RowExclusiveLock Share, Exclusive INSERT, UPDATE, DELETE
ShareLock RowExclusive, Exclusive CREATE INDEX (non-concurrent)
AccessExclusiveLock Everything ALTER TABLE, DROP TABLE, TRUNCATE

Viewing Current Locks

-- Active locks with blocking information
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_query,
       blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
 AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
  ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Lock Wait Timeout

Prevent queries from waiting indefinitely for locks:

-- Set lock timeout (fail fast instead of hanging)
SET lock_timeout = '5s';

-- Useful for migrations: fail immediately if table is locked
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN new_col TEXT;

Deadlock Detection

PostgreSQL checks for deadlocks every deadlock_timeout (default 1s). When detected, it terminates the youngest transaction:

-- Monitor lock waits in real time
SELECT * FROM pg_locks WHERE NOT granted;

-- pg_stat_activity combined with lock info
SELECT a.pid, a.query, l.mode, l.granted, a.wait_event_type, a.wait_event
FROM pg_stat_activity a
JOIN pg_locks l ON a.pid = l.pid
WHERE a.state = 'active'
ORDER BY a.pid;

Optimistic vs Pessimistic Locking

Strategy How It Works Best For
Pessimistic Lock rows before modifying (SELECT FOR UPDATE) High contention, short transactions
Optimistic Check version/timestamp at commit; retry on conflict Low contention, longer transactions
-- Pessimistic: lock the row
BEGIN;
SELECT * FROM inventory WHERE product_id = 1 FOR UPDATE;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 1;
COMMIT;

-- Optimistic: version-based (application pattern)
-- UPDATE inventory SET quantity = quantity - 1, version = version + 1
-- WHERE product_id = 1 AND version = 5;
-- If 0 rows affected → conflict, retry

MySQL note: MySQL InnoDB uses innodb_lock_wait_timeout (default 50s) instead of PostgreSQL's lock_timeout. MySQL also supports SELECT ... FOR UPDATE NOWAIT and SKIP LOCKED (8.0+), which PostgreSQL also supports.


5. Read Replicas

Read Scaling Architecture

Read replicas serve read traffic to offload the primary. A typical setup:

                  ┌─────────┐
  Writes ────────→│ Primary │
                  └────┬────┘
                       │ WAL streaming
              ┌────────┼────────┐
              ▼        ▼        ▼
         ┌────────┐┌────────┐┌────────┐
         │Replica1││Replica2││Replica3│
         └────────┘└────────┘└────────┘
              ▲        ▲        ▲
              └────────┼────────┘
                  Reads (load balanced)

Connection Routing

Application-level routing directs writes to primary and reads to replicas:

# Example: SQLAlchemy routing (simplified)
# Writes go to primary, reads go to replica pool
engine_primary = create_engine("postgresql://primary:5432/mydb")
engine_replica = create_engine("postgresql://replica:5432/mydb")

PostgreSQL tools for connection routing: - PgBouncer: connection pooler (does not route by query type) - Pgpool-II: query-aware load balancer (routes SELECT to replicas) - HAProxy: TCP-level balancing with health checks

# HAProxy example health check for PostgreSQL
# backend pg_replicas
#   option httpchk GET /health
#   server replica1 10.0.1.2:5432 check port 8008
#   server replica2 10.0.1.3:5432 check port 8008

Replica Lag Impact

Reads from replicas may return stale data. This is eventual consistency:

-- On the primary: check lag per replica
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
       replay_lag
FROM pg_stat_replication;

Scenarios where lag matters: - Read-after-write: user creates a record, then immediately reads it from a replica — gets 404 - Reporting: stale data is usually acceptable (seconds-old is fine) - Authentication: session/token lookups must read from primary or use synchronous replicas

Consistency Trade-offs

Pattern Consistency Latency Use Case
All reads from primary Strong Higher Auth, payments, inventory
Reads from sync replica Strong Medium Critical reads needing HA
Reads from async replica Eventual Lower Dashboards, search, reports
Read-your-own-writes routing Session-level Medium User-facing CRUD apps

PostgreSQL supports synchronous_commit levels per-transaction:

-- This transaction is safe to read from sync replica immediately
SET synchronous_commit = 'remote_apply';
INSERT INTO orders (...) VALUES (...);

-- This transaction may have replica lag (but faster)
SET synchronous_commit = 'local';
INSERT INTO analytics_events (...) VALUES (...);

Monitoring Read Replicas

Essential metrics to track:

-- Replica: am I in recovery mode?
SELECT pg_is_in_recovery();

-- Replica: how far behind am I?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;

-- Primary: are all replicas connected?
SELECT count(*) FROM pg_stat_replication;

Alerting thresholds (typical): - Warning: replication lag > 5 seconds - Critical: replication lag > 30 seconds or replica disconnected - Page: replica count drops below minimum (e.g., < 2 of 3)

# Prometheus query for replication lag alerting
# pg_replication_lag_seconds > 5 for 2m → warning
# pg_replication_lag_seconds > 30 for 1m → critical
# pg_stat_replication_count < 2 → page

Wiki Navigation

  • Database Internals Flashcards (CLI) (flashcard_deck, L1) — Database Internals