MySQL中Binlog日志应用慢，该怎么办？学习心得分享

今天有一个业务需求，需要进行数据恢复操作，需要恢复到2021-07-11 15:21:00，大家应该都了解，这种基于时间点的恢复，首先通过物理备份将mysql全量恢复到异机中，然后再进行增量恢复binlog，从而实现基于时间点的恢复；

业务环境数据量大小150G左右，按照之前做的大量的随机恢复测试总时间分析看，150G的数据量恢复大概可以控制在30min内完成；

正常情况下，1G的binlog应用时间大概在1~3min左右，但是在本次恢复应用binlog的过程，花费了将近15min还没有结束，导致整个恢复时间40多分钟还没有结束；这种情况别说业务人员不能接受，作为DBA估计不能接受吧！

下面就来就针对Binlog回放慢的问题做一个简单的分析：

问题现象:

登录到恢复实例上看一下目前的应用状态，查看到线程目前一直处于Waiting for GTID to be committed的状态，从该状态看，是在等待GTID提交，GTID为548723ca-1f7f-11e9-b3ab-005056b748c5:1655852015

mysql>show processlist;
+----+-----------+-----------------+------+---------+--------+----------------------------------+----------------------------------------------------------------------------+-----------+---------------+
| Id | User      | Host            | db   | Command | Time   | State                            | Info                                                                       | Rows_sent | Rows_examined |
+----+-----------+-----------------+------+---------+--------+----------------------------------+----------------------------------------------------------------------------+-----------+---------------+
|  6 | dba_admin | 127.0.0.1:6492  | NULL | Sleep   | 218132 |                                  | NULL                                                                       |         0 |             0 |
| 16 | dba_admin | 127.0.0.1:11286 | NULL | Query   | 218078 | Waiting for GTID to be committed | SET @@SESSION.GTID_NEXT= '548723ca-1f7f-11e9-b3ab-005056b748c5:1655852015' |         0 |             0 |
| 17 | dba_admin | 127.0.0.1:11294 | NULL | Query   |      0 | starting                         | show processlist                                                           |         0 |             0 |
+----+-----------+-----------------+------+---------+--------+----------------------------------+----------------------------------------------------------------------------+-----------+---------------+
3 rows in set (0.00 sec)

问题分析：

1、首先检查恢复机资源使用情况，特别是IO，经过查看，发现恢复机的负载非常的低，资源非常的空闲，所以应该不是资源繁忙导致的

2、那有可能是大事务导致binlog应用的比较慢，接下来分析下binlog的中是否有大事务

$ mysqlbinlog mysqlbin.002032 | grep "GTID$(printf '\t')last_committed" -B 1 | grep -E '^# at' | awk '{print $3}'| awk 'NR==1 {tmp=$1} NR>1 {print ($1-tmp);tmp=$1}'| sort -n -r | head -n 10
mysqlbinlog: [Warning] unknown variable 'loose-default-character-set=utf8'
203995532
518143
518103
518055
518045
518043
518037
518035
518033
518023

从结果看，竟然有一个200M的大事务，我的天哪，估计就是这个大事务导致回放比较慢，那到底是不是这个大事务导致的呢？

3、接下来根据长时间运行线程状态提供的GTID的信息，解析下binlog文件查看下卡主的GTID事务具体在执行什么操作？

# at 156880949
#210711  2:00:05 server id 255  end_log_pos 156881014 CRC32 0x9c02b320  GTID    last_committed=127167   sequence_number=127168
SET @@SESSION.GTID_NEXT= '548723ca-1f7f-11e9-b3ab-005056b748c5:1655852015'/*!*/;
# at 156881014
#210711  2:00:05 server id 255  end_log_pos 156881077 CRC32 0xf27b8fdc  Query   thread_id=55553520      exec_time=54    error_code=0
SET TIMESTAMP=1625940005/*!*/;
BEGIN
/*!*/;
# at 156881077
#210711  2:00:05 server id 255  end_log_pos 156881203 CRC32 0xf297d5f2  Rows_query
# delete from tabname where create_time <= date_sub(CURRENT_TIMESTAMP, interval 24 hour)
# at 156881203
#210711  2:00:05 server id 255  end_log_pos 156881301 CRC32 0xa101984d  Table_map: `test_db`.`tabname` mapped to number 110
# at 156881301
#210711  2:00:05 server id 255  end_log_pos 156889457 CRC32 0xbe518068  Delete_rows: table id 110
# at 156889457
#210711  2:00:05 server id 255  end_log_pos 156897608 CRC32 0x54e957f0  Delete_rows: table id 110
# at 156897608
#210711  2:00:05 server id 255  end_log_pos 156905781 CRC32 0x8d2612ad  Delete_rows: table id 110
# at 156905781
#210711  2:00:05 server id 255  end_log_pos 156913928 CRC32 0xb15a94ea  Delete_rows: table id 110
# at 156913928
#210711  2:00:05 server id 255  end_log_pos 156922097 CRC32 0x6393fa7c  Delete_rows: table id 110
# at 156922097
#210711  2:00:05 server id 255  end_log_pos 156930266 CRC32 0x2b3d1fda  Delete_rows: table id 110
# at 156930266
#210711  2:00:05 server id 255  end_log_pos 156938414 CRC32 0x78874052  Delete_rows: table id 110
# at 156938414
#210711  2:00:05 server id 255  end_log_pos 156946567 CRC32 0xb67779fe  Delete_rows: table id 110
# at 156946567
#210711  2:00:05 server id 255  end_log_pos 156954729 CRC32 0x0c0f8899  Delete_rows: table id 110
# at 156954729
#210711  2:00:05 server id 255  end_log_pos 156962870 CRC32 0x5f79436d  Delete_rows: table id 110

从解析后的binlog可以发现，是一个delete操作， delete from tabname where create_time <= date_sub(CURRENT_TIMESTAMP, interval 24 hour)，该SQL删除24小时之前的所有数据，而该操作正是那个200M的事务，到这里就知道了具体的原因，那么该问题该如何解决呢？

解决方案：

1、那就接着等呗，总会应用结束，这你能接受，业务人员估计也是不能接受的；

2、临时调整sync_binlog和innodb_flush_log_at_trx_commit参数，从双1调整成0，之后回放速度就会快很多；

备注：反正是在异机恢复，安全不是重要的，线上环境谨慎考虑；

好了，就先说这么多吧，大家有更好的方法欢迎留言一起学习交流；