案例 - 一个IP切换引发的数据不一致 _切换

曾无好事来相访，赖尔高文一起予。这篇文章主要讲述案例 - 一个IP切换引发的数据不一致相关的知识，希望能为你提供帮助。
业务说，为什么10号机房缺少这条数据，其他机房却有？

mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10; +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | sid | tm_timestamp | tm_lasttime | gid | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version | __deleted | +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | xxxxxxxxxx | 1495773704 | 1495773704 | xxxxxxxxxxx | 处对象 | 0 | 5 | 3611732366 | vx:wtc2033 | 0 | 18 | 8 | 0 | 0 | 1 | 0 | 0 | | 0 | 6126694332813803019 | 0 | +------------+--------------+-------------+---------------------+-------

大概断定，10号机房的数据同步是有问题的，先看这条记录，是从哪个机房插入的，然后再看10号机房与该机房之间的同步是否有问题，使用8827登录，获取这条数据的版本号__version，由函数转换得到这条数据，来自14号机房插入的，日期:2017-05-26 05:03:03 机房号:14 端口号:11

这相当于MySQL里的binlog，会记录每条SQL，来自于哪个server-id,目的是为了防止循环复制，myshard不仅在binlog记录server-id，每条记录都带有版本号，包含了从哪个机房，哪个端口写入的，什么时候写入的

到这里，知道14号机房写入的数据，无法同步到10号机房，可以去14号看一下同步命令

[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset" shard_local Read_offset 48494420885 shard_local Read_speed 33373 shard_local Read_bytes_behind 0 sync_r12m0 Read_offset 48494420885 sync_r12m0 Read_speed 33373 sync_r12m0 Read_bytes_behind 0 sync_r13m0 Read_offset 48494420885 sync_r13m0 Read_speed 33373 sync_r13m0 Read_bytes_behind 0 sync_r1m0 Read_offset 48494420885 sync_r1m0 Read_speed 33373 sync_r1m0 Read_bytes_behind 0 sync_r3m0 Read_offset 48494420885 sync_r3m0 Read_speed 33373 sync_r3m0 Read_bytes_behind 0 shard_remote Read_offset 52080697507 shard_remote Read_speed 27290 shard_remote Read_bytes_behind 0

发现没有r10m0这个机房来拉取数据，那证明同步有问题了，去10号机房看同步的日志，看到不断去重连14号机房这个点

[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2

看到有很多日志，不断重试去连接14号机房，其中最早的重连发生在

db_sync_xxxxxxxx_r10m0_d.log.13.gz

这个文件，而这个文件在5月14日记录的

-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz -rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz -rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz -rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz -rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz -rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz -rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz -rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz -rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz -rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz -rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz -rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz -rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz -rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz

一般重连只有2种可能，一个是14号机房没有开放白名单，不允许10号机房访问，但之前搭建成功，肯定白名单是开放了，很可能防火墙出问题，于是在14号机房，进行

iptables -n -L|grep 10号机房的IP

发现电信IP是开放了规则，但是联通的IP是没有开放防火墙规则，这是双线机房，而我在5月12日部署的环境，说明部署环境2天后，因为网络质量，电信通道无法连接，改为了联通通道了，而联通IP没有授权，这就导致10号机房无法顺利连接14号机房了，但是当时业务没有使用这个数据库，昨天5月25日，业务开始部署进程在14号机房，发现数据没同步，才找DBA的。我于是马上加入防火墙规则，然后重启同步进程，重新拉取数据，但10号机房还是在报错不断重连

然而在14号机房可以看到另外一个错误

May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132

一直在报一个位置点2587613132不存在，无法拉取...那是因为太久没有连接上，而14号机房的binlog只保留7天导致的，14号+7天=21号出问题，于是解决方案是把在14号机房，寻找存在但还没有被删除的位置点，让10号机房去拉数据，然后询问业务在14号机房有写入操作的表有哪些，然后把14号机房的表数据导出来，然后倒入到10号机房

myshard的好处是可以通过导数来去修补缺失的数据，而mysql只能用percona的修复工具，这也是给自己一个教训，在机房网络条件差的情况下，开通ip必须全部ip都开了，另外业务需要补充数据，事后开会总结了几个规则