欢迎关注我的《深入理解MySQL主从原理 32讲 》,如下:
这个案例是朋友 @peaceful遇到的线上问题,最终线索也是他自己找到的。现象如下:
-rw-r----- 1 mysql dba 12827 Oct 11 12:28 mysql-relay-bin.036615
-rw-r----- 1 mysql dba 4908 Oct 11 12:28 mysql-relay-bin.036616
-rw-r----- 1 mysql dba 1188 Oct 11 12:28 mysql-relay-bin.036617
-rw-r----- 1 mysql dba 5823 Oct 11 12:29 mysql-relay-bin.036618
-rw-r----- 1 mysql dba 507 Oct 11 12:29 mysql-relay-bin.036619
-rw-r----- 1 mysql dba 1188 Oct 11 12:29 mysql-relay-bin.036620
-rw-r----- 1 mysql dba 3203 Oct 11 12:29 mysql-relay-bin.036621
-rw-r----- 1 mysql dba 37916 Oct 11 12:30 mysql-relay-bin.036622
-rw-r----- 1 mysql dba 507 Oct 11 12:30 mysql-relay-bin.036623
-rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036624
-rw-r----- 1 mysql dba 4909 Oct 11 12:31 mysql-relay-bin.036625
-rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036626
-rw-r----- 1 mysql dba 507 Oct 11 12:31 mysql-relay-bin.036627
-rw-r----- 1 mysql dba 507 Oct 11 12:32 mysql-relay-bin.036628
-rw-r----- 1 mysql dba 1188 Oct 11 12:32 mysql-relay-bin.036629
-rw-r----- 1 mysql dba 454 Oct 11 12:32 mysql-relay-bin.036630
-rw-r----- 1 mysql dba 6223 Oct 11 12:32 mysql-relay-bin.index
2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID , found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421).
2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4)
2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID , found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425).
2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4)
实际上第一眼看这个案例我也觉得很奇怪,因为很少有人会去设置slave_net_timeout参数,同样我们也没有设置过,因此关注较少。但是 @peaceful自己找到了可能出现问题的设置就是当前从库slave_net_timeout参数设置为10。我就顺着这个线索往下分析,我们先来看看slave_net_timeout参数的功能。
2、如果change master没有指定MASTER_HEARTBEAT_PERIOD的情况下会设置为slave_net_timeout/2
一般我们配置主从都没有去指定这个心跳周期,因此就是slave_net_timeout/2,它控制的是如果在主库没有Event产生的情况下,多久发送一个心跳Event给从库的IO线程,用于保持连接。但是一旦我们配置了主从(change master)这个值就定下来了,不会随着slave_net_timeout参数的更改而更改,我们可以在slave_master_info表中找到相应的设置如下:
mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.01 sec)
如果我们要更改这个值只能重新 change master才行。
那么这种情况下在主库心跳Event发送给从库的IO线程之前,IO线程已经断开了。断开后IO线程会进行重连,每次重连将会生成新的relay log,但是这些relay log由于延迟问题不能清理就出现了案例中的情况。
If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen
as the value of the Heartbeat column of the mysql.slave_master_info table.
Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the
value of slave_net_timeout divided by 2.
Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval
results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to
reset it to the default value.
mysql> show variables like '%slave_net_timeout%';
| Variable_name | Value |
| slave_net_timeout | 60 |
1 row in set (0.01 sec)
mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.00 sec)
stop slave sql_thread;
mysql> set global slave_net_timeout=10;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show warnings;
| Level | Code | Message |
| Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. |
1 row in set (0.00 sec)
mysql> stop slave ;
Query OK, 0 rows affected (0.01 sec)
mysql> start slave io_thread;
Query OK, 0 rows affected (0.01 sec)
大概每10秒会生成一个relay log文件如下:
-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:32.655001361 +0800 relay.000142
-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:42.943001355 +0800 relay.000143
-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:53.293001363 +0800 relay.000144
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:03.502000598 +0800 relay.000145
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:13.799001357 +0800 relay.000146
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:24.055001354 +0800 relay.000147
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:34.280001827 +0800 relay.000148
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:44.496001365 +0800 relay.000149
-rw-r----- 1 mysql mysql 500 2019-09-27 23:49:54.789001353 +0800 relay.000150
-rw-r----- 1 mysql mysql 500 2019-09-27 23:50:05.485001371 +0800 relay.000151
-rw-r----- 1 mysql mysql 500 2019-09-27 23:50:15.910001430 +0800 relay.000152
2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216).
2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4)
2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217).
2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4)
2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218).
2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4)
2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219).
2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4)
-> mysql_options
mysql->options.connect_timeout= *(uint*) arg;
-> get_vio_connect_timeout
timeout_sec= mysql->options.connect_timeout;
在每次使用从库change master时候会设置这个值如下,默认为slave_net_timeout/2:
mi->heartbeat_period= min(SLAVE_MAX_HEARTBEAT_PERIOD,
因此我们看到只有change master才会重新设置这个值,重启主从是不会重新设置的。
每次IO线程启动时候会将这个值传递给主库的DUMP线程,方式应该是通过构建语句‘SET @master_heartbeat_period’来完成的。如下:
if (mi->heartbeat_period != 0.0)
char llbuf[22];
const char query_format[]= "SET @master_heartbeat_period= %s";
char query[sizeof(query_format) - 2 + sizeof(llbuf)];
user_var_entry *entry=
(user_var_entry*) my_hash_search(&m_thd->user_vars, (uchar*) name.str,
m_heartbeat_period= entry ? entry->val_int(&null_value) : 0;
set_timespec_nsec(&ts, m_heartbeat_period); //心跳超时
ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待
if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到则收到信号,说明有新的Event到来,否则如果是超时则发送心跳Event
break; //正常返回0 是超时返回ETIMEDOUT 继续循环
if (send_heartbeat_event(log_pos)) //发送心跳Event
return 1;
Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid);
THD *tmp= Global_THD_manager::get_instance()->
if (tmp)
Here we do not call kill_one_thread() as
it will be slow because it will iterate through the list
again. We just to do kill the thread ourselves.
if (log_warnings > 1)
if (slave_uuid.length())
sql_print_information("While initializing dump thread for slave with "
"UUID <%s>, found a zombie dump thread with the "
"same UUID. Master is killing the zombie dump "
"thread(%u).", slave_uuid.c_ptr(),
作者微信: gp_22389860
Copyright © 2009-2022 www.kswsj.com 成都快上网科技有限公司 版权所有 蜀ICP备19037934号