Skip to content

MySQL / MariaDB Operations — Street-Level Ops

Quick Diagnosis Commands

# Is MySQL alive?
mysqladmin -u root -p ping
mysqladmin -u root -p status

# What's running right now?
mysql -e "SHOW FULL PROCESSLIST\G" | less

# Key health metrics at a glance
mysqladmin -u root -p extended-status | grep -E \
  "Threads_connected|Slow_queries|Questions|Com_select|Com_insert|Com_update|Com_delete|Innodb_buffer_pool_read_requests|Innodb_buffer_pool_reads"

# Check replication status (replica)
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Running|Seconds|Error|Master_Log"

# Disk space (MySQL data dir)
du -sh /var/lib/mysql/
df -h /var/lib/mysql

One-liner: SHOW FULL PROCESSLIST is the MySQL equivalent of ps aux. When the database is slow, this is always the first command. Look for queries in "Sending data" state with high Time values — those are your blockers.

Gotcha: Replica Lag Grows Monotonically

Seconds_Behind_Master keeps growing. The SQL thread is falling behind.

Rule: Replica SQL thread is single-threaded by default in MySQL 5.6 and earlier. Parallel replication requires configuration.

-- Check if SQL thread is running
SHOW SLAVE STATUS\G
-- Slave_SQL_Running: Yes = running but slow
-- Slave_SQL_Running: No = stopped (check Last_SQL_Error)

-- Enable parallel replication (MySQL 5.7+)
STOP SLAVE SQL_THREAD;
SET GLOBAL slave_parallel_workers = 4;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';  -- better than DATABASE
START SLAVE SQL_THREAD;

-- Monitor slave thread activity
SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- If lag is from a huge transaction, wait it out
-- Check what the SQL thread is currently executing:
SELECT * FROM performance_schema.replication_applier_status_by_coordinator\G

Gotcha: "Too Many Connections" Error

Application gets ERROR 1040 (HY000): Too many connections.

Rule: MySQL has a hard ceiling (max_connections). The connection from the error itself often can't connect either — reserve one connection for root.

-- Check current state (connect as root from the server itself)
SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL STATUS LIKE 'Max_used_connections';
SHOW GLOBAL VARIABLES LIKE 'max_connections';

-- Immediate fix: increase limit (runtime, no restart needed)
SET GLOBAL max_connections = 500;

-- Persistent fix in my.cnf:
-- max_connections = 500

-- Find who is holding connections
SELECT user, host, db, command, time, state
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;

-- Kill sleeping connections older than 60 seconds (generate KILL statements)
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 60 AND user != 'replication';

With ProxySQL: check if the connection pool is exhausted:

-- On ProxySQL admin port 6032
SELECT hostgroup, srv_host, status, ConnUsed, ConnFree, ConnOK, ConnERR
FROM stats_mysql_connection_pool;


Gotcha: Replication Broke — Duplicate Key Error

Last_SQL_Error: Could not execute Write_rows event on table mydb.users;
Duplicate entry '12345' for key 'PRIMARY'

The replica has a row the primary doesn't (out-of-band write to replica, or replica was used for a write that also went to primary).

-- Option 1: Skip one event (quick fix, may mask data divergence)
STOP SLAVE;
SET GLOBAL sql_slave_skip_counter = 1;
START SLAVE;

-- With GTID: you can't use skip_counter; inject an empty transaction instead
STOP SLAVE;
SET GTID_NEXT = 'uuid:transaction_id';   -- the GTID shown in Last_SQL_Error
BEGIN; COMMIT;
SET GTID_NEXT = 'AUTOMATIC';
START SLAVE;

-- Option 2: Verify replica is actually in sync (then fix root cause)
pt-table-checksum --host=primary --databases=mydb --tables=users
-- If drifted, resync: pt-table-sync --execute h=primary,D=mydb,t=users h=replica

-- Option 3: If drift is large, rebuild the replica from a fresh backup

Default trap: sql_slave_skip_counter skips the next N events, not transactions. A multi-statement transaction may have multiple events. Skipping 1 event might leave the transaction half-applied. With GTID replication, the empty-transaction injection method is safer because it operates at the transaction level.


Pattern: Emergency Schema Change on a Live Table

A schema change (ALTER TABLE) acquires a metadata lock and blocks all reads and writes on the table for its duration. On a 100 GB table this can take 30+ minutes.

# Use pt-online-schema-change instead
pt-online-schema-change \
  --alter "ADD COLUMN metadata JSON NULL AFTER updated_at" \
  --host=primary.db.example.com \
  --user=root --password=secret \
  --chunk-size=1000 \
  --max-load "Threads_running=25" \
  --critical-load "Threads_running=50" \
  --execute \
  D=mydb,t=users

# MySQL 8.0+ Online DDL (for some operations — check docs for which ones are instant)
ALTER TABLE users ADD COLUMN metadata JSON NULL, ALGORITHM=INSTANT;
ALTER TABLE users ADD INDEX idx_email (email), ALGORITHM=INPLACE, LOCK=NONE;

Pattern: Identifying and Killing Blocking Transactions

-- Who is blocking writes?
SELECT
  r.trx_mysql_thread_id AS waiting_thread,
  SUBSTRING(r.trx_query, 1, 80) AS waiting_query,
  b.trx_mysql_thread_id AS blocking_thread,
  SUBSTRING(b.trx_query, 1, 80) AS blocking_query,
  b.trx_started AS blocking_started,
  TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_seconds
FROM information_schema.innodb_lock_waits w
  JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
  JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
ORDER BY blocking_seconds DESC;

-- Kill the blocker
KILL 54321;  -- blocking_thread value

-- In MySQL 8.0: use performance_schema for more detail
SELECT * FROM performance_schema.data_lock_waits\G

Scenario: Database Won't Start After Crash

# Step 1: Read the error log
tail -100 /var/log/mysql/error.log
# Common errors:
# "InnoDB: Corruption in the InnoDB tablespace" -> force recovery
# "Table './mydb/users' is marked as crashed" -> MyISAM crash
# "Can't create test file" -> permission issue on data dir

# Step 2: Permission issue fix
ls -la /var/lib/mysql/
chown -R mysql:mysql /var/lib/mysql/
chmod 750 /var/lib/mysql/
systemctl start mysql

# Step 3: InnoDB crash recovery
# Edit /etc/mysql/mysql.conf.d/mysqld.cnf
# Add: innodb_force_recovery = 1
systemctl start mysql
# If it starts: dump immediately
mysqldump --all-databases > /backup/emergency_$(date +%Y%m%d_%H%M%S).sql
# Then rebuild: remove innodb_force_recovery, restore from dump

# Step 4: MyISAM table repair
mysqlcheck -u root -p --repair --all-databases
# Or specific table:
mysql -e "REPAIR TABLE mydb.legacy_table;"

Scenario: Slow Application — Database CPU at 100%

-- Step 1: See what's running
SHOW FULL PROCESSLIST;
-- Look for long-running queries in the State column

-- Step 2: Find the top offenders via performance_schema
SELECT digest_text,
       count_star AS exec_count,
       ROUND(sum_timer_wait / 1e12, 2) AS total_sec,
       ROUND(avg_timer_wait / 1e12, 4) AS avg_sec,
       ROUND(sum_rows_examined / count_star) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE last_seen > NOW() - INTERVAL 10 MINUTE
ORDER BY sum_timer_wait DESC
LIMIT 10;

-- Step 3: EXPLAIN the bad queries
EXPLAIN FORMAT=JSON SELECT ...;

-- Step 4: Kill the worst offenders immediately
SELECT CONCAT('KILL QUERY ', id, ';')
FROM information_schema.processlist
WHERE time > 30 AND user != 'replication'
ORDER BY time DESC;

Emergency: Disk Full — MySQL Stopped Writing

# MySQL logs: "ERROR 28 from storage engine" or "ERROR 1030 (HY000): Got error 28"

# Step 1: Identify what's consuming space
du -sh /var/lib/mysql/*/   # per-database directories
ls -lSh /var/lib/mysql/*.ibd | tail -20  # largest InnoDB tablespace files

# Step 2: Free space immediately
# Purge old binary logs (only if replicas are caught up)
mysql -e "SHOW SLAVE HOSTS;"   # check replicas exist
mysql -e "PURGE BINARY LOGS BEFORE NOW() - INTERVAL 1 DAY;"
# Or purge to a specific file
mysql -e "PURGE BINARY LOGS TO 'mysql-bin.000080';"

# Step 3: If a table has grown unexpectedly (e.g., a log table)
mysql -e "TRUNCATE TABLE mydb.application_logs;"
# Or: DELETE in batches (TRUNCATE is faster but non-transactional)

# Step 4: Reclaim space from deleted rows (InnoDB doesn't shrink files automatically)
mysql -e "OPTIMIZE TABLE mydb.large_table;"
# Or use pt-online-schema-change with a no-op alter to rebuild the file

# Step 5: Increase disk or add volume and symlink

Useful One-Liners

# Quick replica health check
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Running|Seconds|Error"

# Find tables with most rows
mysql -e "SELECT table_schema, table_name, table_rows, ROUND(data_length/1024/1024,1) AS data_mb
          FROM information_schema.tables ORDER BY data_length DESC LIMIT 20;"

# Show current variables (buffer pool, connections)
mysql -e "SHOW GLOBAL VARIABLES LIKE 'innodb_buffer%';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads%';"

# List indexes on a table
mysql -e "SHOW INDEX FROM mydb.orders\G"

# Show running queries older than 5 seconds
mysql -e "SELECT id,user,host,db,time,state,left(info,80) FROM information_schema.processlist WHERE time > 5 ORDER BY time DESC;"

# Replica: skip one GTID error transaction
mysql -e "STOP SLAVE; SET GTID_NEXT='$(mysql -e "SHOW SLAVE STATUS\G" | grep Retrieved | awk '{print $2}')'; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC'; START SLAVE;"

# Check table sizes
mysql -e "SELECT CONCAT(table_schema,'.',table_name) AS tbl,
  ROUND((data_length+index_length)/1024/1024,1) AS total_mb
  FROM information_schema.tables ORDER BY total_mb DESC LIMIT 20;"

# Flush slow query log without restart
mysql -e "FLUSH SLOW LOGS;"

Debug clue: If innodb_buffer_pool_reads is high relative to innodb_buffer_pool_read_requests, InnoDB is reading from disk instead of cache. The hit ratio should be >99%. Calculate it: 1 - (reads / read_requests) * 100. If it is below 99%, the buffer pool is too small. Increase innodb_buffer_pool_size — the general rule is 70-80% of total RAM on a dedicated database server.