MySQL / MariaDB Operations Footguns¶
1. Running ALTER TABLE on a large table in production without pt-osc¶
You run ALTER TABLE orders ADD COLUMN notes TEXT; on a 200 GB table during business hours. MySQL acquires a metadata lock. All reads and writes to orders queue behind it. Application times out after 30 seconds. Traffic backs up. You've just caused a production outage.
Fix: Always use pt-online-schema-change or gh-ost for schema changes on tables larger than a few GB. For MySQL 8.0+, check if the operation supports ALGORITHM=INSTANT first (many simple column additions do).
War story: GitHub built
gh-ostspecifically becausept-online-schema-changeuses triggers, which caused replication lag spikes on their large MySQL fleet.gh-ostreads the binary log instead of using triggers, giving it a lighter footprint on the primary. If you run replicas at high utilization,gh-ostis the safer choice for large tables.
2. Skipping --single-transaction in mysqldump¶
You run mysqldump mydb > backup.sql without --single-transaction. mysqldump acquires a global read lock (FLUSH TABLES WITH READ LOCK) at the start, blocking all writes for the duration of the dump. On a large database, this can be minutes to hours of total write downtime.
Fix: Always use --single-transaction for InnoDB tables. This dumps inside a consistent MVCC snapshot without locking.
Note:Under the hood:
--single-transactionworks by issuingSTART TRANSACTION WITH CONSISTENT SNAPSHOTbefore reading any tables. InnoDB's MVCC gives the dump a point-in-time view without locks. However, DDL statements (ALTER TABLE, DROP TABLE) during the dump can still break the snapshot on MySQL 5.7. MySQL 8.0 added metadata lock protection to prevent this.
--single-transaction does not work for MyISAM tables (use XtraBackup for mixed engines).
3. Doing writes on a replica¶
Your application has a bug (or misconfigured connection string) that sends writes to the replica. The replica applies them locally. Now the replica has rows the primary doesn't. Replication breaks with a duplicate key error the next time the primary tries to insert the same row.
Fix: Set read_only = ON and super_read_only = ON on all replicas:
[mysqld]
read_only = ON
super_read_only = ON # prevents even SUPER users from writing (MySQL 5.7.8+)
Com_insert/update/delete on replicas via alerting.
4. Ignoring Seconds_Behind_Master = NULL¶
SHOW SLAVE STATUS shows Seconds_Behind_Master: NULL and you assume it means "no lag" (as in, it can't be measured). Actually NULL means the SQL thread is not running — replication is completely stopped.
Fix: Distinguish NULL from 0. Write monitoring that treats NULL as a critical alert. The Slave_SQL_Running field will show No when NULL — check both:
Gotcha:
Seconds_Behind_Mastermeasures the difference between the replica's SQL thread timestamp and the relay log event timestamp — not real-time lag. If the replica's I/O thread is caught up but the SQL thread is paused or slow, the number can be misleadingly low. For accurate lag measurement, usept-heartbeat, which inserts timestamps on the primary and measures actual delivery time on replicas.
5. Using KILL on the wrong thread and killing the replica thread¶
You have connection ID 42 in PROCESSLIST that you want to kill. You run KILL 42. But thread 42 is the slave SQL thread. Replication stops. Slave_SQL_Running goes to No.
Fix: Before killing, verify what a thread is doing:
Never kill threads withuser = 'system user' or command = 'Binlog Dump' — these are replication threads.
6. Using PURGE BINARY LOGS before replicas have consumed them¶
You're running low on disk and run PURGE BINARY LOGS BEFORE NOW();. A replica that was offline for maintenance (or just slow) needed those logs to catch up. Now it can't reconnect and needs a full resync.
Fix: Check all replica positions before purging:
SHOW SLAVE HOSTS; -- list connected replicas
-- For each replica, connect and run:
SHOW SLAVE STATUS\G -- note Master_Log_File and Read_Master_Log_Pos
-- Only purge logs older than the oldest replica position
expire_logs_days or binlog_expire_logs_seconds and let MySQL auto-purge after confirming replicas are caught up.
7. Setting innodb_flush_log_at_trx_commit = 0 and not documenting it¶
You set innodb_flush_log_at_trx_commit = 0 for write throughput during a bulk load and forget to revert it. The server runs for months with this setting. A crash loses up to 1 second of committed transactions with no warning.
Fix: After bulk loads, always revert to 1 (full ACID):
8. Not testing backup restores¶
You've been taking mysqldump backups daily for 6 months. A disaster strikes. You try to restore and discover the dump file was silently corrupted (truncated gzip, failed export mid-stream due to disk full, etc.). Your backup is useless.
Fix: Test restores regularly:
1. Schedule a monthly restore to a test instance
2. Verify checksums: md5sum backup.sql.gz before and after transfer
3. Test the restore process: zcat backup.sql.gz | mysql testdb
4. Run pt-table-checksum against the restored copy
5. Use XtraBackup which has built-in integrity verification
9. Long-running transactions causing undo log bloat¶
A batch job starts a transaction, processes rows for 2 hours, then commits. Meanwhile, InnoDB keeps all row versions generated during those 2 hours in the undo log. ibdata1 (the system tablespace) grows to 50 GB. Unlike data files, InnoDB cannot shrink the system tablespace without rebuilding.
Fix: Break large batch operations into smaller chunks, each with their own commit:
-- Instead of one giant transaction:
START TRANSACTION;
-- process 1M rows
COMMIT;
-- Do this:
WHILE more_rows DO
START TRANSACTION;
UPDATE orders SET status='processed' WHERE id BETWEEN @start AND @start+999;
COMMIT;
SET @start = @start + 1000;
END WHILE;
10. Relying on SET GLOBAL for configuration without persisting to my.cnf¶
You increase max_connections at runtime: SET GLOBAL max_connections = 500. MySQL restarts 3 months later during a kernel update. It comes back with max_connections = 151 (the default). Application fails immediately.
Fix: Always update both the runtime value AND my.cnf. Better yet, use MySQL 8.0's SET PERSIST:
SET PERSIST max_connections = 500; -- writes to mysqld-auto.cnf, survives restart
SET PERSIST innodb_buffer_pool_size = 12884901888; -- 12 GB
11. Using SELECT * in application queries¶
Your application queries SELECT * FROM users WHERE id = 123. The DBA adds a profile_photo LONGBLOB column. Every user lookup now transfers megabytes over the wire and saturates buffer pool pages with binary data your application never uses.
Fix: Always select only needed columns in application queries. This also makes index-only scans possible. Monitor with pt-query-digest to catch regressions.
12. Not setting max_allowed_packet high enough for large payloads¶
Application inserts a JSON payload larger than max_allowed_packet (default 16 MB in MySQL 5.7, 64 MB in 8.0). Gets ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes. Confusingly, this can also break replication if the primary accepted a large write but the replica's max_allowed_packet is smaller.
Fix: Set consistently on primary and all replicas:
Set the same value everywhere — if the primary accepts a 100 MB row and the replica has a 16 MB limit, replication breaks.