The Database That Wouldn't Start

lesson
postgresql
mysql
wal
lock-files
data-directory-permissions
backup
recovery
l2 ---# The Database That Wouldn't Start

Topics: PostgreSQL, MySQL, WAL, lock files, data directory permissions, backup, recovery Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic SQL familiarity helpful but not required

The Mission¶

$ systemctl start postgresql
Job for postgresql.service failed because the service did not take the steps required by its unit configuration.

The database won't start. It was fine yesterday. Nobody (allegedly) changed anything. The application is returning 500 errors, and the on-call Slack channel is getting noisy.

Database startup failures are uniquely stressful because data might be at risk. Every minute of downtime is also a minute of potential data loss if you don't handle recovery correctly. This lesson covers the most common reasons databases refuse to start and how to fix each one without making things worse.

Rule Zero: Don't Panic, Don't Guess¶

Before you touch anything:

# 1. Read the actual error
journalctl -u postgresql -n 100
# or
tail -100 /var/log/postgresql/postgresql-16-main.log

# 2. Check the data directory
ls -la /var/lib/postgresql/16/main/

# 3. Check disk space
df -h /var/lib/postgresql/
df -i /var/lib/postgresql/

# 4. Check if something else is using the port
ss -tlnp | grep 5432

Mental Model: A database that won't start is a locked safe. You need to figure out which lock is jammed before you force anything. Forcing the wrong lock destroys data.

Cause 1: Stale Lock File (PID File)¶

The most common cause. PostgreSQL writes its PID to postmaster.pid on startup and removes it on clean shutdown. If the process crashes (power loss, OOM kill, kill -9), the file stays behind.

cat /var/lib/postgresql/16/main/postmaster.pid
# → 12345          ← PID from the old process
# → /var/lib/postgresql/16/main
# → 1711108800     ← epoch timestamp of start
# → 5432           ← port

# Is that PID still alive?
ps -p 12345
# → (nothing) — the old process is dead, but the PID file remains

Fix:

# Verify the process is truly dead
ps aux | grep postgres
# If nothing — safe to remove

rm /var/lib/postgresql/16/main/postmaster.pid
systemctl start postgresql

MySQL has a similar problem with mysqld.pid and the socket file:

# MySQL: stale socket
ls -la /var/run/mysqld/mysqld.sock
# → srwxrwxrwx 1 mysql mysql 0 Mar 22 ... /var/run/mysqld/mysqld.sock

rm /var/run/mysqld/mysqld.sock
systemctl start mysql

Gotcha: Don't remove the PID file if the process IS still running. If PostgreSQL is actually alive (maybe you started it manually and forgot), removing the PID file lets a second instance start. Two PostgreSQL instances writing to the same data directory = data corruption.

Cause 2: Data Directory Permissions¶

Databases are paranoid about file permissions — for good reason.

# PostgreSQL requires the data directory to be owned by postgres, mode 700
ls -ld /var/lib/postgresql/16/main/
# → drwx------ 19 postgres postgres 4096 Mar 22 ... /var/lib/postgresql/16/main/

# If someone ran chmod or chown on the wrong directory:
# → drwxr-xr-x 19 root root ...   ← WRONG owner, WRONG permissions

PostgreSQL refuses to start if: - Data directory isn't owned by the postgres user - Data directory permissions are more open than 0700 - Files inside have wrong ownership

# Fix
sudo chown -R postgres:postgres /var/lib/postgresql/16/main/
sudo chmod 700 /var/lib/postgresql/16/main/
systemctl start postgresql

MySQL is similarly strict:

sudo chown -R mysql:mysql /var/lib/mysql/
sudo chmod 750 /var/lib/mysql/

War Story: An engineer ran chown -R deploy:deploy /var/lib/ to fix a permission issue on their app's data directory. The -R was recursive. It changed ownership of /var/lib/postgresql/, /var/lib/mysql/, /var/lib/docker/, and everything else under /var/lib/. PostgreSQL, MySQL, and Docker all refused to start. Recovery required chown -R for each service's data directory, checking file permissions, and restarting in the correct order. The lesson: never run chown -R on a parent directory when you mean to target a specific child.

Cause 3: Disk Full¶

Databases need disk space for: - WAL files (Write-Ahead Log) — transaction journal - Temp files — query execution, sorting - Logs — server log files - Data files — the actual tables and indexes

When the disk fills, the database can't write WAL files and shuts down or goes read-only:

PostgreSQL: PANIC: could not write to file "pg_wal/xlog": No space left on device
MySQL: InnoDB: Write to file ./ib_logfile0 failed at offset ... os_file_write_func

# Check disk space
df -h /var/lib/postgresql/
# → /dev/sda1  50G  50G  0  100% /

# Find what's eating space
du -sh /var/lib/postgresql/16/main/pg_wal/
# → 12G  pg_wal/     ← WAL files accumulated

du -sh /var/lib/postgresql/16/main/base/
# → 35G  base/       ← data files

Fix (PostgreSQL WAL buildup):

WAL files accumulate when: - Replication is lagging (WAL kept until replica catches up) - archive_command is failing (WAL kept until successfully archived) - wal_keep_size is too high

# Check archive status
psql -c "SELECT * FROM pg_stat_archiver;"
# If last_failed_time is recent → archive_command is broken

# Check replication lag
psql -c "SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS lag_bytes FROM pg_stat_replication;"

Gotcha: Don't manually delete WAL files from pg_wal/. PostgreSQL tracks which files are needed for recovery and replication. If you delete the wrong one, you break crash recovery. Use pg_archivecleanup or fix the underlying cause (stuck archiver, lagging replica).

Cause 4: WAL Corruption / Crash Recovery¶

After an unclean shutdown (power loss, OOM kill, kernel panic), the database needs to replay WAL files to recover uncommitted transactions. Usually this is automatic:

PostgreSQL: LOG: database system was not properly shut down; automatic recovery in progress
PostgreSQL: LOG: redo starts at 0/3000028
PostgreSQL: LOG: redo done at 0/30000C0
PostgreSQL: LOG: database system is ready to accept connections

But if WAL files are corrupted:

PostgreSQL: PANIC: could not locate a valid checkpoint record

This is serious. The database can't find a consistent state to recover to.

# Last resort — reset the WAL (DATA LOSS POSSIBLE)
# This tells PostgreSQL to start from scratch, ignoring corrupt WAL
sudo -u postgres pg_resetwal -f /var/lib/postgresql/16/main/

# Then start and check for corruption
systemctl start postgresql
psql -c "SELECT count(*) FROM pg_catalog.pg_tables;"

Gotcha: pg_resetwal is a data loss operation. It resets the transaction log, which means any transactions that were committed but not yet flushed to data files are lost. It's the last resort — try everything else first. Always take a filesystem backup (cp -a /var/lib/postgresql/) before running it.

Cause 5: Connection Exhaustion¶

The database is running but won't accept new connections:

FATAL: too many connections for role "appuser"
FATAL: sorry, too many clients already

# How many connections exist?
psql -c "SELECT count(*) FROM pg_stat_activity;"
# → 100 (the default max_connections)

# What are they doing?
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
# → idle          85    ← these are wasting connection slots
# → active         5
# → idle in transaction 10  ← these are holding locks

Fix (immediate): Kill idle connections:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

Fix (permanent): Use a connection pooler:

# PgBouncer in transaction mode
# App connects to PgBouncer (port 6432)
# PgBouncer maintains a small pool to PostgreSQL (port 5432)
# 200 app connections → 20 database connections

Remember: The PostgreSQL connection sizing rule: cores × 2 + effective spindles. A 4-core server with SSDs needs about 10-20 actual database connections, not 200. More connections = more context switching, more memory, worse performance. Use a pooler.

The Diagnostic Ladder¶

Database won't start
│
├── Read the logs first!
│   journalctl -u postgresql -n 100
│
├── Stale PID file?
│   cat postmaster.pid → is that PID alive?
│   └── Dead PID → rm postmaster.pid → start
│
├── Wrong permissions?
│   ls -la data-directory/
│   └── Wrong owner/mode → chown + chmod → start
│
├── Disk full?
│   df -h data-directory/
│   └── Full → find what's eating space → clean up → start
│
├── WAL corruption?
│   Logs say "could not locate a valid checkpoint record"
│   └── BACKUP FIRST → pg_resetwal (last resort) → start → check data
│
├── Port in use?
│   ss -tlnp | grep 5432
│   └── Another process → kill it or change port → start
│
└── Connections exhausted? (running but won't accept new)
    psql → SELECT count(*) FROM pg_stat_activity
    └── Too many idle → terminate + add pooler

Flashcard Check¶

Q1: PostgreSQL won't start, postmaster.pid exists. First check?

Is the PID in the file still a running process (ps -p PID). Only remove the file if the process is truly dead. Two instances on one data directory = corruption.

Q2: chown -R user /var/lib/ — why is this catastrophic?

It changes ownership of every service's data directory under /var/lib/. PostgreSQL, MySQL, Docker, and others all refuse to start with wrong ownership.

Q3: WAL files are 12GB and growing. Can you delete them?

Never manually delete from pg_wal/. PostgreSQL tracks which files are needed for recovery and replication. Fix the root cause (stuck archiver, lagging replica) instead.

Q4: What is pg_resetwal and when do you use it?

It resets the transaction log. Last resort when PostgreSQL can't find a valid checkpoint. It causes data loss for any transactions not yet flushed to data files. Always backup the data directory first.

Q5: PostgreSQL has 100 connections, 85 are idle. What's the fix?

Immediate: terminate idle connections. Permanent: add PgBouncer in transaction mode. The sizing rule is cores × 2 + effective spindles ≈ 10-20 real connections.

Exercises¶

Exercise 1: Read the PostgreSQL logs (investigation)¶

If you have PostgreSQL installed:

# Find the log file
sudo ls /var/log/postgresql/

# Check recent startup messages
sudo tail -50 /var/log/postgresql/postgresql-*-main.log

# What version is running? What port?
psql -c "SHOW server_version;"
psql -c "SHOW port;"

Exercise 2: The decision (think)¶

For each error message, identify the cause and first command to run:

FATAL: data directory "/var/lib/postgresql/16/main" has wrong ownership
FATAL: lock file "postmaster.pid" already exists
PANIC: could not write to file "pg_wal/xlog": No space left on device
FATAL: sorry, too many clients already

Answers

1. **Wrong permissions.** `ls -la /var/lib/postgresql/16/main/` to see current, then `chown -R postgres:postgres` to fix. 2. **Stale PID file.** `cat postmaster.pid` to get the PID, then `ps -p PID` to check if alive. If dead: `rm postmaster.pid`. 3. **Disk full.** `df -h /var/lib/postgresql/` to confirm, then `du -sh pg_wal/` to check WAL buildup. Fix archiver or replication lag. 4. **Connection exhaustion.** `psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"` to see what's using connections. Kill idle, add PgBouncer.

Cheat Sheet¶

PostgreSQL Quick Checks¶

Task	Command
Service status	`systemctl status postgresql`
Recent logs	`journalctl -u postgresql -n 50`
Check data dir	`ls -la /var/lib/postgresql/16/main/`
Check PID file	`cat /var/lib/postgresql/16/main/postmaster.pid`
Connection count	`psql -c "SELECT count(*) FROM pg_stat_activity;"`
WAL size	`du -sh /var/lib/postgresql/16/main/pg_wal/`
Replication lag	`psql -c "SELECT * FROM pg_stat_replication;"`
Kill idle connections	`SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '10 min';`

MySQL Quick Checks¶

Task	Command
Service status	`systemctl status mysql`
Error log	`tail -50 /var/log/mysql/error.log`
Check PID/socket	`ls -la /var/run/mysqld/`
Connection count	`mysql -e "SHOW STATUS LIKE 'Threads_connected';"`
InnoDB status	`mysql -e "SHOW ENGINE INNODB STATUS\G"`

Takeaways¶

Read the logs before touching anything. The error message tells you exactly what's wrong 90% of the time. Don't guess — read.
Stale PID files are the #1 startup failure. After any unclean shutdown (OOM, crash, power loss), check and remove if the process is dead.
Permissions must be exact. PostgreSQL requires postgres:postgres ownership and 0700 on the data directory. Any deviation = refused to start.
Never manually delete WAL files. Fix the root cause (archiver, replication lag) instead. Deleting the wrong WAL file breaks crash recovery.
Use a connection pooler. Direct connections are expensive. PgBouncer in transaction mode lets 200 app connections share 20 database connections.

The Disk That Filled Up — when the database's disk is full
Permission Denied — file permissions across layers
The Hanging Deploy — when the app can't connect to the database