Portal | Level: L1: Foundations | Topics: MySQL / MariaDB Operations | Domain: DevOps & Tooling
MySQL / MariaDB Operations — Primer¶
Why This Matters¶
MySQL and MariaDB are running in nearly every production environment. As an operator, you are responsible for replication health, query performance, backups that actually restore, and keeping the server alive under load. The database is usually the hardest component to scale horizontally and the most painful to recover if something goes wrong. Understanding InnoDB internals, replication mechanics, and the right tools for schema changes and backups is the difference between a 5-minute fix and a 4-hour outage.
Core Concepts¶
1. InnoDB Internals¶
InnoDB is the default storage engine. Almost everything performance-related traces back to its internal structures.
Name origin: MySQL was created by Michael "Monty" Widenius and David Axmark in 1995. "My" is Monty's daughter's name. After Oracle acquired MySQL via Sun Microsystems in 2010, Monty forked it as MariaDB -- named after his younger daughter, Maria. InnoDB was originally developed by Innobase Oy, a Finnish company, and stood for "Inno DataBase."
Buffer Pool The buffer pool is InnoDB's in-memory cache for data pages and index pages. This is the most important tuning knob.
# my.cnf
[mysqld]
innodb_buffer_pool_size = 12G # 70-80% of RAM for dedicated DB servers
innodb_buffer_pool_instances = 8 # reduce contention; 1 per GiB up to 8-16
innodb_buffer_pool_chunk_size = 128M # must divide evenly into pool size
-- Check hit ratio (should be > 99%)
SELECT (1 - (
SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads'
) / (
SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests'
)) * 100 AS buffer_pool_hit_rate;
-- Current buffer pool usage
SHOW ENGINE INNODB STATUS\G -- look for BUFFER POOL AND MEMORY section
Redo Log (Write-Ahead Log) InnoDB writes changes to the redo log before applying them to data pages. On crash, redo logs replay uncommitted transactions.
innodb_log_file_size = 1G # larger = less frequent checkpoints = better write throughput
innodb_log_files_in_group = 2 # total redo log = log_file_size * files_in_group
innodb_flush_log_at_trx_commit = 1 # 1 = ACID (flush on every commit); 2 = flush every second
# Default trap: setting this to 0 or 2 for "performance" means you can lose
# up to 1 second of committed transactions on crash. Only acceptable for
# replicas or non-critical data where you can replay from the primary.
Undo Log and MVCC InnoDB uses MVCC (Multi-Version Concurrency Control) for non-locking reads. Each row stores a pointer to undo log entries for old versions. Long-running transactions cause undo log bloat.
-- Check undo log size
SELECT name, subsystem, status, comment
FROM information_schema.innodb_metrics
WHERE name LIKE 'trx_rseg%';
-- Find long-running transactions causing undo bloat
SELECT trx_id, trx_started, trx_state, trx_query,
TIME_TO_SEC(TIMEDIFF(NOW(), trx_started)) AS seconds_running
FROM information_schema.innodb_trx
ORDER BY trx_started
LIMIT 10;
2. EXPLAIN and Query Plan Reading¶
-- Basic explain
EXPLAIN SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending';
-- Extended: shows filtered rows estimate and extra info
EXPLAIN FORMAT=JSON SELECT ...;
-- Run the query and see actual row counts
EXPLAIN ANALYZE SELECT ...; -- MySQL 8.0+
Key fields in EXPLAIN output:
| Field | What to watch for |
|---|---|
type |
ALL = full table scan (bad). ref, eq_ref, const = good. |
key |
Which index was used. NULL = no index. |
rows |
Estimated rows examined. High = potential performance issue. |
Extra |
Using filesort and Using temporary are expensive. |
filtered |
% of rows remaining after condition. Low % means many rows scanned and discarded. |
Remember: Mnemonic for EXPLAIN danger signs: "ATF" -- ALL, Temporary, Filesort. If you see
type=ALL(full table scan),Using temporary, orUsing filesortin EXPLAIN output, the query needs optimization. These three are responsible for the vast majority of slow-query incidents.
-- Example: identify missing index
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- If type=ALL and rows=1000000, you need an index on customer_id
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Composite index: leftmost prefix rule
-- This index covers (a), (a,b), (a,b,c) but NOT (b) or (b,c)
CREATE INDEX idx_composite ON orders(customer_id, status, created_at);
-- Force a specific index (for testing)
SELECT * FROM orders FORCE INDEX (idx_orders_customer) WHERE customer_id = 123;
3. Slow Query Log¶
# my.cnf — enable slow query log
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 1 # log queries taking more than 1 second
log_queries_not_using_indexes = 1
min_examined_row_limit = 1000 # don't log fast full-scans of tiny tables
# Enable at runtime without restart
SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 0.5; # lower threshold to find more
# Analyze with pt-query-digest (Percona Toolkit)
pt-query-digest /var/log/mysql/slow.log
pt-query-digest /var/log/mysql/slow.log --since '1h ago'
pt-query-digest /var/log/mysql/slow.log --filter '$event->{user} eq "appuser"'
4. Replication — GTID vs Binlog Position¶
Binlog Position Replication (traditional)
-- On primary: find current position
SHOW MASTER STATUS;
-- Returns: File=mysql-bin.000042, Position=1234567
-- On replica:
CHANGE MASTER TO
MASTER_HOST='primary.db.example.com',
MASTER_USER='replication',
MASTER_PASSWORD='secret',
MASTER_LOG_FILE='mysql-bin.000042',
MASTER_LOG_POS=1234567;
START SLAVE;
SHOW SLAVE STATUS\G
GTID Replication (preferred for MySQL 5.7+)
GTIDs (Global Transaction IDs) are server_uuid:transaction_id. Every transaction has a globally unique ID, making failover and replicas much easier to manage.
[mysqld]
gtid_mode = ON
enforce_gtid_consistency = ON
log_slave_updates = ON # required for chained replication
binlog_format = ROW
-- On replica with GTIDs:
CHANGE MASTER TO
MASTER_HOST='primary.db.example.com',
MASTER_USER='replication',
MASTER_PASSWORD='secret',
MASTER_AUTO_POSITION=1; -- no file/position needed
START SLAVE;
-- Check GTID executed sets
SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
SHOW SLAVE STATUS\G
-- Key field: Seconds_Behind_Master
Replica Lag
-- Check replica lag
SHOW SLAVE STATUS\G
-- Seconds_Behind_Master: NULL = not replicating, 0 = up to date, N = N seconds behind
-- Deeper lag analysis via heartbeat (Percona pt-heartbeat)
pt-heartbeat --database=percona --update # run on primary
pt-heartbeat --database=percona --monitor # run on replica
-- Check if replica is actually applying SQL or just receiving
-- Exec_Master_Log_Pos vs Read_Master_Log_Pos — if they diverge, SQL thread is behind
5. ProxySQL for Connection Pooling¶
ProxySQL sits between your application and MySQL, pooling connections and routing reads vs writes.
-- Connect to ProxySQL admin interface (port 6032)
mysql -u admin -padmin -h 127.0.0.1 -P 6032
-- Add backend servers
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (1,'primary',3306);
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (2,'replica1',3306);
INSERT INTO mysql_servers (hostgroup_id, hostname, port) VALUES (2,'replica2',3306);
-- Routing rules: writes to HG1, reads to HG2
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES (1, 1, '^SELECT', 2); -- reads to HG2
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES (2, 1, '.*', 1); -- everything else to HG1 (primary)
-- Apply changes (required after every modification)
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;
-- Monitor connections
SELECT * FROM stats_mysql_connection_pool;
SELECT * FROM stats_mysql_commands_counters WHERE total_cnt > 0;
6. Percona Toolkit¶
pt-query-digest: Analyze slow query log, general log, or tcpdump.
pt-query-digest /var/log/mysql/slow.log | head -200
pt-query-digest --output=json /var/log/mysql/slow.log > report.json
Gotcha:
pt-online-schema-changeworks by creating a shadow copy of the table, applying the ALTER to the copy, setting up triggers to replay changes, then swapping the tables. On very large tables (100+ GB) this can take hours and doubles disk usage temporarily. Monitor disk space during the operation. Also, tables with foreign key constraints or triggers may not be compatible.
pt-online-schema-change: Non-blocking schema changes via shadow table + triggers.
# Add a column without locking the table
pt-online-schema-change \
--alter "ADD COLUMN last_seen DATETIME NULL" \
--host=primary.db.example.com \
--user=root --password=secret \
--execute \
D=mydb,t=users
# Dry run first
pt-online-schema-change --alter "..." --dry-run D=mydb,t=users
pt-table-checksum and pt-table-sync: Verify and repair replica drift.
# Check if replicas are in sync
pt-table-checksum --host=primary --user=root --password=secret \
--databases=mydb --tables=users
# Sync a drifted replica (run FROM primary, targeting replica)
pt-table-sync --execute h=primary,D=mydb,t=users h=replica1
7. Backups¶
mysqldump — logical backup, slow on large databases, safe for small ones.
# Full backup with consistent snapshot
mysqldump --single-transaction --routines --triggers --events \
--master-data=2 --all-databases \
-u root -p > /backup/full_$(date +%Y%m%d_%H%M%S).sql
# Single database
mysqldump --single-transaction mydb | gzip > /backup/mydb_$(date +%F).sql.gz
# Restore
mysql -u root -p < /backup/full.sql
zcat /backup/mydb.sql.gz | mysql -u root -p mydb
Percona XtraBackup — physical hot backup, fast, works while MySQL is live.
# Full backup
xtrabackup --backup --user=backup --password=secret \
--target-dir=/backup/full
# Prepare (apply redo logs — must do before restore)
xtrabackup --prepare --target-dir=/backup/full
# Restore (server must be stopped)
systemctl stop mysql
rsync -avc --delete /backup/full/ /var/lib/mysql/
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql
# Incremental backup
xtrabackup --backup --incremental-basedir=/backup/full \
--target-dir=/backup/inc1
# Prepare incremental into full
xtrabackup --prepare --apply-log-only --target-dir=/backup/full
xtrabackup --prepare --apply-log-only --target-dir=/backup/full \
--incremental-dir=/backup/inc1
xtrabackup --prepare --target-dir=/backup/full # final prepare
8. Binary Log¶
-- Check binary log status
SHOW MASTER STATUS;
SHOW BINARY LOGS; -- lists all binlog files and sizes
-- Read binary log events
SHOW BINLOG EVENTS IN 'mysql-bin.000042' LIMIT 20;
-- Point-in-time recovery using mysqlbinlog
mysqlbinlog --start-datetime="2024-01-15 10:00:00" \
--stop-datetime="2024-01-15 10:30:00" \
/var/lib/mysql/mysql-bin.000042 | mysql -u root -p
# Purge old binlogs (replicas must have consumed them first)
PURGE BINARY LOGS TO 'mysql-bin.000040';
PURGE BINARY LOGS BEFORE '2024-01-01 00:00:00';
9. Crash Recovery¶
# If MySQL won't start after crash, check error log first
# /var/log/mysql/error.log
# If InnoDB needs forced recovery:
[mysqld]
innodb_force_recovery = 1 # Start with 1, increment if it still fails
# Values 1-6:
# 1: ignore corrupt pages
# 2: prevent background threads from running
# 3: no transaction rollback
# 4: prevent INSERT buffer merge
# 5: no undo logs
# 6: no redo log (extreme — may produce corrupt data)
# With innodb_force_recovery set, dump your data immediately
mysqldump --all-databases > /backup/emergency_dump.sql
# Then rebuild from scratch and restore
10. performance_schema and Locking¶
-- Find currently running queries
SELECT * FROM information_schema.processlist
WHERE time > 10
ORDER BY time DESC;
-- Kill a long-running query
KILL QUERY 12345; -- kill just the query, keep connection
KILL 12345; -- kill the connection
-- Lock waits — who is blocking whom
SELECT
r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- Deadlock analysis
SHOW ENGINE INNODB STATUS\G
-- Look for the LATEST DETECTED DEADLOCK section
-- performance_schema: top queries by total time
SELECT digest_text, count_star, round(sum_timer_wait/1e12,3) AS total_sec,
round(avg_timer_wait/1e12,6) AS avg_sec
FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC
LIMIT 20;
Quick Reference¶
# Connect
mysql -u root -p -h localhost
mysql -u appuser -p mydb -h db.example.com -P 3306
# Server status
mysqladmin -u root -p status
mysqladmin -u root -p processlist
mysqladmin -u root -p variables | grep buffer_pool
# Replication quick check (on replica)
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Running|Behind|Error"
# Backup
mysqldump --single-transaction mydb | gzip > mydb_$(date +%F).sql.gz
# Restore
zcat mydb.sql.gz | mysql mydb
# Error log location
SHOW VARIABLES LIKE 'log_error';
Wiki Navigation¶
Prerequisites¶
- Database Operations on Kubernetes (Topic Pack, L2)