Skip to content

The Database That Wouldn't Start

  • lesson
  • postgresql
  • mysql
  • wal
  • lock-files
  • data-directory-permissions
  • backup
  • recovery
  • l2 ---# The Database That Wouldn't Start

Topics: PostgreSQL, MySQL, WAL, lock files, data directory permissions, backup, recovery Level: L2 (Operations) Time: 60–75 minutes Prerequisites: Basic SQL familiarity helpful but not required


The Mission

$ systemctl start postgresql
Job for postgresql.service failed because the service did not take the steps required by its unit configuration.

The database won't start. It was fine yesterday. Nobody (allegedly) changed anything. The application is returning 500 errors, and the on-call Slack channel is getting noisy.

Database startup failures are uniquely stressful because data might be at risk. Every minute of downtime is also a minute of potential data loss if you don't handle recovery correctly. This lesson covers the most common reasons databases refuse to start and how to fix each one without making things worse.


Rule Zero: Don't Panic, Don't Guess

Before you touch anything:

# 1. Read the actual error
journalctl -u postgresql -n 100
# or
tail -100 /var/log/postgresql/postgresql-16-main.log

# 2. Check the data directory
ls -la /var/lib/postgresql/16/main/

# 3. Check disk space
df -h /var/lib/postgresql/
df -i /var/lib/postgresql/

# 4. Check if something else is using the port
ss -tlnp | grep 5432

Mental Model: A database that won't start is a locked safe. You need to figure out which lock is jammed before you force anything. Forcing the wrong lock destroys data.


Cause 1: Stale Lock File (PID File)

The most common cause. PostgreSQL writes its PID to postmaster.pid on startup and removes it on clean shutdown. If the process crashes (power loss, OOM kill, kill -9), the file stays behind.

cat /var/lib/postgresql/16/main/postmaster.pid
# → 12345          ← PID from the old process
# → /var/lib/postgresql/16/main
# → 1711108800     ← epoch timestamp of start
# → 5432           ← port

# Is that PID still alive?
ps -p 12345
# → (nothing) — the old process is dead, but the PID file remains

Fix:

# Verify the process is truly dead
ps aux | grep postgres
# If nothing — safe to remove

rm /var/lib/postgresql/16/main/postmaster.pid
systemctl start postgresql

MySQL has a similar problem with mysqld.pid and the socket file:

# MySQL: stale socket
ls -la /var/run/mysqld/mysqld.sock
# → srwxrwxrwx 1 mysql mysql 0 Mar 22 ... /var/run/mysqld/mysqld.sock

rm /var/run/mysqld/mysqld.sock
systemctl start mysql

Gotcha: Don't remove the PID file if the process IS still running. If PostgreSQL is actually alive (maybe you started it manually and forgot), removing the PID file lets a second instance start. Two PostgreSQL instances writing to the same data directory = data corruption.


Cause 2: Data Directory Permissions

Databases are paranoid about file permissions — for good reason.

# PostgreSQL requires the data directory to be owned by postgres, mode 700
ls -ld /var/lib/postgresql/16/main/
# → drwx------ 19 postgres postgres 4096 Mar 22 ... /var/lib/postgresql/16/main/

# If someone ran chmod or chown on the wrong directory:
# → drwxr-xr-x 19 root root ...   ← WRONG owner, WRONG permissions

PostgreSQL refuses to start if: - Data directory isn't owned by the postgres user - Data directory permissions are more open than 0700 - Files inside have wrong ownership

# Fix
sudo chown -R postgres:postgres /var/lib/postgresql/16/main/
sudo chmod 700 /var/lib/postgresql/16/main/
systemctl start postgresql

MySQL is similarly strict:

sudo chown -R mysql:mysql /var/lib/mysql/
sudo chmod 750 /var/lib/mysql/

War Story: An engineer ran chown -R deploy:deploy /var/lib/ to fix a permission issue on their app's data directory. The -R was recursive. It changed ownership of /var/lib/postgresql/, /var/lib/mysql/, /var/lib/docker/, and everything else under /var/lib/. PostgreSQL, MySQL, and Docker all refused to start. Recovery required chown -R for each service's data directory, checking file permissions, and restarting in the correct order. The lesson: never run chown -R on a parent directory when you mean to target a specific child.


Cause 3: Disk Full

Databases need disk space for: - WAL files (Write-Ahead Log) — transaction journal - Temp files — query execution, sorting - Logs — server log files - Data files — the actual tables and indexes

When the disk fills, the database can't write WAL files and shuts down or goes read-only:

PostgreSQL: PANIC: could not write to file "pg_wal/xlog": No space left on device
MySQL: InnoDB: Write to file ./ib_logfile0 failed at offset ... os_file_write_func
# Check disk space
df -h /var/lib/postgresql/
# → /dev/sda1  50G  50G  0  100% /

# Find what's eating space
du -sh /var/lib/postgresql/16/main/pg_wal/
# → 12G  pg_wal/     ← WAL files accumulated

du -sh /var/lib/postgresql/16/main/base/
# → 35G  base/       ← data files

Fix (PostgreSQL WAL buildup):

WAL files accumulate when: - Replication is lagging (WAL kept until replica catches up) - archive_command is failing (WAL kept until successfully archived) - wal_keep_size is too high

# Check archive status
psql -c "SELECT * FROM pg_stat_archiver;"
# If last_failed_time is recent → archive_command is broken

# Check replication lag
psql -c "SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS lag_bytes FROM pg_stat_replication;"

Gotcha: Don't manually delete WAL files from pg_wal/. PostgreSQL tracks which files are needed for recovery and replication. If you delete the wrong one, you break crash recovery. Use pg_archivecleanup or fix the underlying cause (stuck archiver, lagging replica).


Cause 4: WAL Corruption / Crash Recovery

After an unclean shutdown (power loss, OOM kill, kernel panic), the database needs to replay WAL files to recover uncommitted transactions. Usually this is automatic:

PostgreSQL: LOG: database system was not properly shut down; automatic recovery in progress
PostgreSQL: LOG: redo starts at 0/3000028
PostgreSQL: LOG: redo done at 0/30000C0
PostgreSQL: LOG: database system is ready to accept connections

But if WAL files are corrupted:

PostgreSQL: PANIC: could not locate a valid checkpoint record

This is serious. The database can't find a consistent state to recover to.

# Last resort — reset the WAL (DATA LOSS POSSIBLE)
# This tells PostgreSQL to start from scratch, ignoring corrupt WAL
sudo -u postgres pg_resetwal -f /var/lib/postgresql/16/main/

# Then start and check for corruption
systemctl start postgresql
psql -c "SELECT count(*) FROM pg_catalog.pg_tables;"

Gotcha: pg_resetwal is a data loss operation. It resets the transaction log, which means any transactions that were committed but not yet flushed to data files are lost. It's the last resort — try everything else first. Always take a filesystem backup (cp -a /var/lib/postgresql/) before running it.


Cause 5: Connection Exhaustion

The database is running but won't accept new connections:

FATAL: too many connections for role "appuser"
FATAL: sorry, too many clients already
# How many connections exist?
psql -c "SELECT count(*) FROM pg_stat_activity;"
# → 100 (the default max_connections)

# What are they doing?
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
# → idle          85    ← these are wasting connection slots
# → active         5
# → idle in transaction 10  ← these are holding locks

Fix (immediate): Kill idle connections:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';

Fix (permanent): Use a connection pooler:

# PgBouncer in transaction mode
# App connects to PgBouncer (port 6432)
# PgBouncer maintains a small pool to PostgreSQL (port 5432)
# 200 app connections → 20 database connections

Remember: The PostgreSQL connection sizing rule: cores × 2 + effective spindles. A 4-core server with SSDs needs about 10-20 actual database connections, not 200. More connections = more context switching, more memory, worse performance. Use a pooler.


The Diagnostic Ladder

Database won't start
├── Read the logs first!
│   journalctl -u postgresql -n 100
├── Stale PID file?
│   cat postmaster.pid → is that PID alive?
│   └── Dead PID → rm postmaster.pid → start
├── Wrong permissions?
│   ls -la data-directory/
│   └── Wrong owner/mode → chown + chmod → start
├── Disk full?
│   df -h data-directory/
│   └── Full → find what's eating space → clean up → start
├── WAL corruption?
│   Logs say "could not locate a valid checkpoint record"
│   └── BACKUP FIRST → pg_resetwal (last resort) → start → check data
├── Port in use?
│   ss -tlnp | grep 5432
│   └── Another process → kill it or change port → start
└── Connections exhausted? (running but won't accept new)
    psql → SELECT count(*) FROM pg_stat_activity
    └── Too many idle → terminate + add pooler

Flashcard Check

Q1: PostgreSQL won't start, postmaster.pid exists. First check?

Is the PID in the file still a running process (ps -p PID). Only remove the file if the process is truly dead. Two instances on one data directory = corruption.

Q2: chown -R user /var/lib/ — why is this catastrophic?

It changes ownership of every service's data directory under /var/lib/. PostgreSQL, MySQL, Docker, and others all refuse to start with wrong ownership.

Q3: WAL files are 12GB and growing. Can you delete them?

Never manually delete from pg_wal/. PostgreSQL tracks which files are needed for recovery and replication. Fix the root cause (stuck archiver, lagging replica) instead.

Q4: What is pg_resetwal and when do you use it?

It resets the transaction log. Last resort when PostgreSQL can't find a valid checkpoint. It causes data loss for any transactions not yet flushed to data files. Always backup the data directory first.

Q5: PostgreSQL has 100 connections, 85 are idle. What's the fix?

Immediate: terminate idle connections. Permanent: add PgBouncer in transaction mode. The sizing rule is cores × 2 + effective spindles ≈ 10-20 real connections.


Exercises

Exercise 1: Read the PostgreSQL logs (investigation)

If you have PostgreSQL installed:

# Find the log file
sudo ls /var/log/postgresql/

# Check recent startup messages
sudo tail -50 /var/log/postgresql/postgresql-*-main.log

# What version is running? What port?
psql -c "SHOW server_version;"
psql -c "SHOW port;"

Exercise 2: The decision (think)

For each error message, identify the cause and first command to run:

  1. FATAL: data directory "/var/lib/postgresql/16/main" has wrong ownership
  2. FATAL: lock file "postmaster.pid" already exists
  3. PANIC: could not write to file "pg_wal/xlog": No space left on device
  4. FATAL: sorry, too many clients already
Answers 1. **Wrong permissions.** `ls -la /var/lib/postgresql/16/main/` to see current, then `chown -R postgres:postgres` to fix. 2. **Stale PID file.** `cat postmaster.pid` to get the PID, then `ps -p PID` to check if alive. If dead: `rm postmaster.pid`. 3. **Disk full.** `df -h /var/lib/postgresql/` to confirm, then `du -sh pg_wal/` to check WAL buildup. Fix archiver or replication lag. 4. **Connection exhaustion.** `psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"` to see what's using connections. Kill idle, add PgBouncer.

Cheat Sheet

PostgreSQL Quick Checks

Task Command
Service status systemctl status postgresql
Recent logs journalctl -u postgresql -n 50
Check data dir ls -la /var/lib/postgresql/16/main/
Check PID file cat /var/lib/postgresql/16/main/postmaster.pid
Connection count psql -c "SELECT count(*) FROM pg_stat_activity;"
WAL size du -sh /var/lib/postgresql/16/main/pg_wal/
Replication lag psql -c "SELECT * FROM pg_stat_replication;"
Kill idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '10 min';

MySQL Quick Checks

Task Command
Service status systemctl status mysql
Error log tail -50 /var/log/mysql/error.log
Check PID/socket ls -la /var/run/mysqld/
Connection count mysql -e "SHOW STATUS LIKE 'Threads_connected';"
InnoDB status mysql -e "SHOW ENGINE INNODB STATUS\G"

Takeaways

  1. Read the logs before touching anything. The error message tells you exactly what's wrong 90% of the time. Don't guess — read.

  2. Stale PID files are the #1 startup failure. After any unclean shutdown (OOM, crash, power loss), check and remove if the process is dead.

  3. Permissions must be exact. PostgreSQL requires postgres:postgres ownership and 0700 on the data directory. Any deviation = refused to start.

  4. Never manually delete WAL files. Fix the root cause (archiver, replication lag) instead. Deleting the wrong WAL file breaks crash recovery.

  5. Use a connection pooler. Direct connections are expensive. PgBouncer in transaction mode lets 200 app connections share 20 database connections.


  • The Disk That Filled Up — when the database's disk is full
  • Permission Denied — file permissions across layers
  • The Hanging Deploy — when the app can't connect to the database