- devops
- l1
- topic-pack
- database-internals --- Portal | Level: L1: Foundations | Topics: Database Internals | Domain: DevOps & Tooling
Database Internals for Ops - Primer¶
Why This Matters¶
DevOps owns the database infrastructure — provisioning, backups, failover, monitoring, and scaling. You do not need to write SQL all day, but you absolutely need to understand what is happening inside the engine when a page takes 12 seconds to load at 2 AM and the on-call Slack channel is on fire. Misunderstanding replication lag has caused more split-brain incidents than any network partition. Ignoring index bloat has brought more production databases to their knees than any traffic spike. This primer gives you the internals knowledge to prevent outages, diagnose them fast, and talk to DBAs as a peer.
PostgreSQL is the primary example throughout — it is the most common database in modern DevOps stacks. MySQL differences are noted where they matter.
1. Replication¶
How It Works¶
Name origin: The term WAL (Write-Ahead Log) means exactly what it says: writes go to the log ahead of (before) being applied to the actual data files. This guarantees that even if the server crashes mid-write, the WAL can replay the changes on recovery. MySQL calls its equivalent the "binary log" (binlog), but the principle is identical.
Replication copies data from a primary (read-write) to one or more replicas (read-only). In PostgreSQL, this is built on WAL (Write-Ahead Log) shipping. Every change is first written to WAL, then streamed to replicas.
-- Check replication status on the primary
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;
Synchronous vs Asynchronous¶
| Mode | Behavior | Trade-off |
|---|---|---|
| Asynchronous | Primary does not wait for replica acknowledgment | Fast writes, risk of data loss on failover |
| Synchronous | Primary waits for at least one replica to confirm | Slower writes, zero data loss on failover |
-- PostgreSQL: set synchronous replication
ALTER SYSTEM SET synchronous_standby_names = 'replica1';
SELECT pg_reload_conf();
In MySQL, the equivalent is semi-synchronous replication (rpl_semi_sync_master_enabled), which waits for at least one replica to acknowledge receipt of the binary log event before committing.
Replication Lag¶
Replication lag is the delay between a write on the primary and its availability on a replica. Monitor it constantly:
# PostgreSQL: check lag in seconds (on replica)
psql -c "SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;"
Common causes of lag: replica under-provisioned (CPU/IO), long-running queries on replica, network saturation, large transactions.
Failover¶
When the primary goes down, a replica must be promoted:
# PostgreSQL 12+: promote a standby to primary
pg_ctl promote -D /var/lib/postgresql/data
# Or via SQL
SELECT pg_promote();
Automated failover tools: Patroni (PostgreSQL), Orchestrator (MySQL), pg_auto_failover.
Split-Brain¶
Split-brain occurs when two nodes both believe they are the primary and accept writes. This causes data divergence that is extremely difficult to resolve.
Fun fact: STONITH stands for "Shoot The Other Node In The Head." Despite the violent name, it is a standard term in high-availability clustering. The idea is that when you cannot confirm a node is down, the safest action is to forcibly power it off (via IPMI, iLO, or cloud API) before promoting a new primary. Better to kill a node that might be alive than risk two primaries writing simultaneously.
Prevention strategies: - Use a consensus mechanism (Patroni uses etcd/ZooKeeper/Consul) - Fencing: STONITH (Shoot The Other Node In The Head) — power off the old primary - Quorum-based decisions: require majority agreement before promotion
2. Transactions¶
Remember: ACID mnemonic: All or nothing (Atomicity), Consistent state transitions (Consistency), Independent transactions (Isolation), Durable after commit (Durability). If you can remember "ACID = reliable transactions," the individual properties follow logically.
ACID Properties¶
| Property | Meaning | Ops Impact |
|---|---|---|
| Atomicity | All or nothing — partial commits do not exist | Failed migrations roll back cleanly |
| Consistency | Data moves from one valid state to another | Constraint violations are caught, not silently corrupted |
| Isolation | Concurrent transactions do not interfere | Determines what dirty reads you might see |
| Durability | Committed data survives crashes | WAL + fsync guarantees; disabling fsync is dangerous |
Isolation Levels¶
From weakest to strongest:
| Level | Dirty Reads | Non-Repeatable Reads | Phantom Reads | PostgreSQL Default |
|---|---|---|---|---|
| Read Uncommitted | Yes | Yes | Yes | No (treated as Read Committed) |
| Read Committed | No | Yes | Yes | Yes (default) |
| Repeatable Read | No | No | Yes (PG: No) | No |
| Serializable | No | No | No | No |
-- Set isolation level for a transaction
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- ... queries ...
COMMIT;
-- Check current isolation level
SHOW transaction_isolation;
MySQL note: MySQL's default is REPEATABLE READ, and its implementation uses gap locks to prevent phantom reads — different from PostgreSQL's MVCC-based approach.
Deadlocks¶
A deadlock occurs when two transactions each hold a lock the other needs. PostgreSQL automatically detects deadlocks and kills one transaction:
# Search PostgreSQL logs for deadlocks
grep -i "deadlock detected" /var/log/postgresql/postgresql-*.log
Long-Running Transactions¶
Long transactions hold locks, bloat MVCC vacuum, and block DDL operations. Find and kill them:
-- Find transactions running longer than 5 minutes
SELECT pid, now() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
AND xact_start < now() - interval '5 minutes'
ORDER BY duration DESC;
-- Terminate a specific backend
SELECT pg_terminate_backend(12345);
Set a statement timeout to prevent runaway queries:
-- Per-session timeout
SET statement_timeout = '30s';
-- Global default (postgresql.conf)
-- statement_timeout = 30000 # milliseconds
3. Indexing¶
B-Tree Indexes¶
Under the hood: A B-tree index in PostgreSQL is a balanced tree where each internal node contains keys and pointers to child pages. A typical B-tree with 8KB pages and 8-byte keys can hold ~500 keys per page. A three-level B-tree can index ~125 million rows, meaning most lookups require only 3 disk reads (often cached). This is why index lookups are O(log N) and feel instantaneous even on large tables.
B-tree is the default and most common index type. It supports equality and range queries on sortable data:
-- Create a basic index
CREATE INDEX idx_orders_customer ON orders (customer_id);
-- Create a composite index (column order matters)
CREATE INDEX idx_orders_cust_date ON orders (customer_id, created_at);
Other index types in PostgreSQL: GIN (full-text, JSONB), GiST (geometric, range), BRIN (large sequential tables), Hash (equality only).
Query Plans with EXPLAIN¶
EXPLAIN ANALYZE is your most important diagnostic tool. It shows the actual execution plan and timing:
-- Always use ANALYZE for real timings (runs the query)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42 AND created_at > '2025-01-01';
Key things to look for:
- Seq Scan on large tables — missing index
- Nested Loop with high row counts — consider join strategy
- Buffers: shared hit vs read — cache efficiency
- Actual rows vs planned rows — stale statistics (run ANALYZE)
-- Update table statistics for the query planner
ANALYZE orders;
-- Full database statistics refresh
ANALYZE;
Missing Indexes¶
Find slow queries caused by missing indexes:
-- PostgreSQL: find sequential scans on large tables
SELECT schemaname, relname, seq_scan, seq_tup_read, idx_scan,
CASE WHEN seq_scan > 0 THEN seq_tup_read / seq_scan ELSE 0 END AS avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;
Enable pg_stat_statements to find the most expensive queries:
-- Top 10 queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Index Bloat¶
Indexes accumulate dead tuples over time. Bloated indexes waste disk and slow queries:
-- Check index sizes vs table sizes
SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
Fix bloat with REINDEX (locks the table) or CREATE INDEX CONCURRENTLY + swap:
Covering Indexes¶
A covering index includes all columns needed by a query, avoiding a table lookup entirely (index-only scan):
-- Include non-key columns for covering
CREATE INDEX idx_orders_cover ON orders (customer_id) INCLUDE (total, status);
-- This query can be satisfied from the index alone
SELECT total, status FROM orders WHERE customer_id = 42;
MySQL note: MySQL InnoDB has implicit covering via the clustered index (primary key). Use EXPLAIN and look for Using index in the Extra column.
4. Locking¶
Lock Types¶
PostgreSQL uses a multi-level locking system:
| Lock Level | Conflicts With | Typical Cause |
|---|---|---|
| RowShareLock | Exclusive | SELECT FOR UPDATE |
| RowExclusiveLock | Share, Exclusive | INSERT, UPDATE, DELETE |
| ShareLock | RowExclusive, Exclusive | CREATE INDEX (non-concurrent) |
| AccessExclusiveLock | Everything | ALTER TABLE, DROP TABLE, TRUNCATE |
Viewing Current Locks¶
-- Active locks with blocking information
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Lock Wait Timeout¶
Prevent queries from waiting indefinitely for locks:
-- Set lock timeout (fail fast instead of hanging)
SET lock_timeout = '5s';
-- Useful for migrations: fail immediately if table is locked
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN new_col TEXT;
Deadlock Detection¶
PostgreSQL checks for deadlocks every deadlock_timeout (default 1s). When detected, it terminates the youngest transaction:
-- Monitor lock waits in real time
SELECT * FROM pg_locks WHERE NOT granted;
-- pg_stat_activity combined with lock info
SELECT a.pid, a.query, l.mode, l.granted, a.wait_event_type, a.wait_event
FROM pg_stat_activity a
JOIN pg_locks l ON a.pid = l.pid
WHERE a.state = 'active'
ORDER BY a.pid;
Optimistic vs Pessimistic Locking¶
| Strategy | How It Works | Best For |
|---|---|---|
| Pessimistic | Lock rows before modifying (SELECT FOR UPDATE) |
High contention, short transactions |
| Optimistic | Check version/timestamp at commit; retry on conflict | Low contention, longer transactions |
-- Pessimistic: lock the row
BEGIN;
SELECT * FROM inventory WHERE product_id = 1 FOR UPDATE;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 1;
COMMIT;
-- Optimistic: version-based (application pattern)
-- UPDATE inventory SET quantity = quantity - 1, version = version + 1
-- WHERE product_id = 1 AND version = 5;
-- If 0 rows affected → conflict, retry
MySQL note: MySQL InnoDB uses innodb_lock_wait_timeout (default 50s) instead of PostgreSQL's lock_timeout. MySQL also supports SELECT ... FOR UPDATE NOWAIT and SKIP LOCKED (8.0+), which PostgreSQL also supports.
5. Read Replicas¶
Read Scaling Architecture¶
Read replicas serve read traffic to offload the primary. A typical setup:
┌─────────┐
Writes ────────→│ Primary │
└────┬────┘
│ WAL streaming
┌────────┼────────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│Replica1││Replica2││Replica3│
└────────┘└────────┘└────────┘
▲ ▲ ▲
└────────┼────────┘
Reads (load balanced)
Connection Routing¶
Application-level routing directs writes to primary and reads to replicas:
# Example: SQLAlchemy routing (simplified)
# Writes go to primary, reads go to replica pool
engine_primary = create_engine("postgresql://primary:5432/mydb")
engine_replica = create_engine("postgresql://replica:5432/mydb")
PostgreSQL tools for connection routing: - PgBouncer: connection pooler (does not route by query type) - Pgpool-II: query-aware load balancer (routes SELECT to replicas) - HAProxy: TCP-level balancing with health checks
# HAProxy example health check for PostgreSQL
# backend pg_replicas
# option httpchk GET /health
# server replica1 10.0.1.2:5432 check port 8008
# server replica2 10.0.1.3:5432 check port 8008
Replica Lag Impact¶
Reads from replicas may return stale data. This is eventual consistency:
-- On the primary: check lag per replica
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
replay_lag
FROM pg_stat_replication;
Scenarios where lag matters: - Read-after-write: user creates a record, then immediately reads it from a replica — gets 404 - Reporting: stale data is usually acceptable (seconds-old is fine) - Authentication: session/token lookups must read from primary or use synchronous replicas
Consistency Trade-offs¶
| Pattern | Consistency | Latency | Use Case |
|---|---|---|---|
| All reads from primary | Strong | Higher | Auth, payments, inventory |
| Reads from sync replica | Strong | Medium | Critical reads needing HA |
| Reads from async replica | Eventual | Lower | Dashboards, search, reports |
| Read-your-own-writes routing | Session-level | Medium | User-facing CRUD apps |
PostgreSQL supports synchronous_commit levels per-transaction:
-- This transaction is safe to read from sync replica immediately
SET synchronous_commit = 'remote_apply';
INSERT INTO orders (...) VALUES (...);
-- This transaction may have replica lag (but faster)
SET synchronous_commit = 'local';
INSERT INTO analytics_events (...) VALUES (...);
Monitoring Read Replicas¶
Essential metrics to track:
-- Replica: am I in recovery mode?
SELECT pg_is_in_recovery();
-- Replica: how far behind am I?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
-- Primary: are all replicas connected?
SELECT count(*) FROM pg_stat_replication;
Alerting thresholds (typical): - Warning: replication lag > 5 seconds - Critical: replication lag > 30 seconds or replica disconnected - Page: replica count drops below minimum (e.g., < 2 of 3)
# Prometheus query for replication lag alerting
# pg_replication_lag_seconds > 5 for 2m → warning
# pg_replication_lag_seconds > 30 for 1m → critical
# pg_stat_replication_count < 2 → page
Wiki Navigation¶
Related Content¶
- Database Internals Flashcards (CLI) (flashcard_deck, L1) — Database Internals