MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';
TL;DR: This is probably caused by poor table design combined with ROW-based replication.
I just ran into this problem. I was asked to move an old database to a new server and set up replication.
I found that it's not actually the statement in the subject that causes the slave to hang (SET @@SESSION.GTID_NEXT= 'ANONYMOUS'). This statement is issued at the beginning of a transaction.
mysql> SHOW BINLOG EVENTS IN 'mysql-bin.000196' FROM 96754384 LIMIT 5000;
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| mysql-bin.000196 | 96754384 | Anonymous_Gtid | 1 | 96754449 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
| mysql-bin.000196 | 96754449 | Query | 1 | 96754537 | BEGIN |
| mysql-bin.000196 | 96754537 | Table_map | 1 | 96754608 | table_id: 241 (db.bad_table) |
| mysql-bin.000196 | 96754608 | Delete_rows | 1 | 96762805 | table_id: 241 |
| mysql-bin.000196 | 96762805 | Delete_rows | 1 | 96771002 | table_id: 241 |
...
| mysql-bin.000196 | 106681175 | Delete_rows | 1 | 106689372 | table_id: 241 |
| mysql-bin.000196 | 106689372 | Delete_rows | 1 | 106697569 | table_id: 241 |
| mysql-bin.000196 | 106697569 | Delete_rows | 1 | 106697626 | table_id: 241 flags: STMT_END_F |
| mysql-bin.000196 | 106697626 | Xid | 1 | 106697657 | COMMIT /* xid=28382600 */ |
| mysql-bin.000196 | 106697657 | Rotate | 1 | 106697704 | mysql-bin.000197;pos=4 |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
1219 rows in set (0.02 sec)
This table has 66 million rows. I found that it has no primary key or unique key. The query responsible for this uses an index scan on the master.
mysql> SHOW VARIABLES LIKE '%binlog_format%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW |
+---------------+-------+
1 row in set (0.00 sec)
For the slave to replicate this with ROW-based replication, it needs to perform approximately 1200 full table scans on the slave. 1200 is probably a fairly small number here. It could be in the hundreds of thousands. The replication does actually work, but with this design, 'seconds_behind_master' will grow indefinitely.
I will add a primary key and partitioning to this table. I will also ask my colleagues to rewrite their code so bulk deletes are no longer necessary. This probably requires adding an additional column.
EDIT: I don't have enough points to comment on other posts, so I will add my comments here for now. I believe that issuing 'SET GLOBAL sql_slave_skip_counter = 1', as mentioned by others, will skip the entire transaction and lead to data inconsistencies. Correct me if I'm wrong.
A quick fix would be to change the binlog format to QUERY or MIXED. These formats can also lead to data inconsistencies, so I would recommend finding and fixing the root cause instead of changing the binlog format.