mysql主备同步问题总结

mysql主备同步问题总结
一、能用命令修复的几种报错:
基础知识:mysql主备同步模式有三种:半同步、异步和双通道. 其中,半同步和异步同步模式在备库上执行show slave status\G的结果只有row1, 若结果里除了row1还有row2,同步模式则为双通道。
1.1 last_errno:1305错误
last_error:error ‘savepoint trans5 does not exist’ on query. Default database: ‘shop’. Query: ‘release savepoint’ trans5
解决方法:
在备库上执行:set global rds_query_event_filter = ‘RELEASE SAVEPOINT’; stop slave; start slave;
1.2 Last_SQL_Errno: 145
Last_SQL_Error: Worker 0 failed executing transaction ‘’ at master log mysql-bin.000080, end_log_pos 283450; Error ‘Table ‘./mysql/proc’ is marked as crashed and should be repaired’ on query. Default database: ‘yj_xjp_0324_01’. Query: 'DROP PROCEDURE IF EXISTS update_txyz_xgtl
Replicate_Ignore_Server_Ids:
内部表损坏,对系统内部进行修复即可:
stop slave;
repair table mysql.proc;
start slave;
1.3 Got fatal error 1236原因和解决方法
logevent超过max_allowed_packet 大小
1.3.1 Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the start event position from ‘mysql-bin.006730’ at 290066246, the last event was read from '/u01/my3309/log/mysql-bin.006730
【原因】
此类报错和max_allowed_packet参数的设置值相关。首先max_allowed_packet控制着主从复制过程中一个语句产生的二进制binlog event大小,它的值必须是1024的倍数 。出现此类错误的常见原因是
1) 该参数在主备库的配置大小不一样,主库的配置值大于从库的配置值。 从主库传递到备库的binlog event大小超过了主库或者备库的max_allowed_packet值大小。
2) 主库有大量数据写入时,如在主库上执行 load data,insert into … select 语句,产生大事务。
当主库向从库传递一个比从库的max_allowed_packet 大的packet ,从库接收该packet失败,并报 “log event entry exceeded max_allowed_packet“。
【如何解决】
需要确保主备配置一样,然后尝试调大该参数的值。
另外,5.6 版本中的 slave_max_allowed_packet_size 参数控制slave 可以接收的最大的packet 大小,该值通常大于而且可以覆盖 max_allowed_packet 的配置, 进而减少由于上面的问题导致主从复制中断。
RDS通常将这两个参数设置成了最大值1G . 如果无法调整,只能重搭备库
1). set global max_allowed_packet =110241024*1024;
2). stop slave;
3). start slave
1.3.2. Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the start event position from ‘mysql-bin.006730’ at 290066434, the last event was read from '/u01/my3309/log/mysql-bin.006730
【原因】
该错误发生在从库的io进程从主库拉取日志时发现主库的binlog文件已不存在了。出现此类报错可能是slave 由于某种原因停止了很长时间,主备断开时间太长导致当重启slave复制时,主库上对应的binlog已清理会报此类错误,从库获取不到对应的binglog file。
【如何解决】
为了避免数据丢失,需要对主库做全量物理备份后重搭slave 。
如果报错是第二通道,可用如下命令修复:
stop slave;
change master to master_auto_position=1;
start slave;
1.4 Last_IO_Error:Fatal error: failed to call ‘slaveReply’
一般数据库突然关闭,机器宕机直接重启、系统夯、网络短时间异常等都可能会导致这个问题发生
此错误的详细分析可参考此链接
https://yq.aliyun.com/articles/72505
start slave;即可修复
1.5 Slave_SQL_Running: No
Error_code:1050
Query caused different errors on master and slave. Error on master: message (format)=‘Table ‘%-.192s’ already exists’ error code=1050 ; Error on slave: actual message=‘no error’, error code=0. Default database: ‘ma3c3762’. Query: 'CREATE ALGORITHM=UNDEFINED DEFINER=shebaoxt@% SQL SECURITY DEFINER VIEW account_decl AS select kc92.AAZ354 AS zhsbid,kc92.AAZ158 AS rydjid,kc92.AAC001 AS ryid,kc92.AAZ039 AS dwdjid,kc92.AAB001 AS dwid,kc92.AAE140 AS xzbz,kc92.AAE002 AS ny,kc92.BAA044 AS jfrylb,kc92.BAC027 AS jzjs,kc92.BAC028 AS dwjzjs,kc92.AAE081 AS dwjze,kc92.AAE083 AS grjze,kc92.AAE347 AS jsly,kc92.AAE072 AS zdlsh,kc92.AAE346 AS bfzdlsh,kc92.BAE038 AS kjdefs,kc92.BKC209 AS dekje,kc92.BAE239 AS ttywxkje,kc92.AAE013 AS bz,kc92.AAE011 AS jbr,kc92.AAE036 AS jbsj,kc92.BKC220 AS kjhlbxfs,kc92.BKC221 AS hlbxkje from kc92
这是mysql官方复制的一个bug(之前问研发貌似还没修复,待确认)
单通道的版本无法跳过,只能重搭备库
双通道则可用如下命令修复:
STOP SLAVE; RESET SLAVE;
change master to master_auto_position=1;
start slave
1.6 Last_Errno: 1594
Last_Error: Relay log read failure: Trx too big to be delivered in parallel mode. please set slave_parallel_workers= 0 and start slave.
使用如下命令修复:
stop slave;
set global slave_parallel_workers= 0;
start slave;
1.7 Last_SQL_Errno: 1071
Last_SQL_Error: Error ‘Specified key was too long; max key length is 767 bytes’ on query. Default database: ‘’. Query: ‘alter table ipadcms.passkey engine=innodb ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8’
索引太长解决办法:
先查看主备库的表信息
打开innodb_large_prefix
show global variables like ‘innodb_large_prefix’;
set global innodb_large_prefix=on;
start slave;
show slave status\G;
1.8 slave_pending_jobs_size_max参数
last_error:cannot schedule event write_row_v1,relay-logname/home/mysql/dataxx/mysql/slave-relay.xxx,postion 74163438 to worker thread because its size 243233017 exceeds 167772160 of slave_pending_jobs_size_max.
在报错的角色上执行:set global slave_pending_jobs_size_max=167772160*4
1.9、doublebinlog row2 sql线程比较经常发生的可用命令修复的bug问题

处理的客户案例:1FA67DN-护城河V2.3.3alimonitor上报两个RDS实例存在check_ins_slave_sql_running.428告警,查询结果同步中断 发行时间:mysql_20161027 版本:5.6.16;
DG9K2T9_ 澳門智慧城市v3.3.4发行时间:mysql_20170920 版本:5.6.38-log;
5P9E2AW _公安部一所v3.5.2_实例主备不一致;
问题描述:Query caused different errors on master and slave. Error on master: message (format)=‘Cannot delete or update a parent row: a foreign key constraint
fails’ error code=1217 ; Error on slave: actual message=‘no error’, error code=0. Default database: ‘wisp_oauth’. Query: ‘DROP TABLE IF EXISTS sys_department /* generated by server */’;

Could not execute Delete_rows_v1 event on table ckts_dllt.ycsb_ls_hd_hsh; Can’t find record in ‘ycsb_ls_hd_hsh’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log mysql-bin.071260, end_log_pos 337827479
类似于这类日志重放错误,都属于这类bug,因为row2通道sql线程本身是不做日志重放的
问题原因:比较老的版本是有个bug,快速start slave; stop slave可能把同步通道的SQL Thread激活
是否可修复:可修复 stop slave for channel ‘#alibaba_rds_sync_channel#’;
reset slave for channel ‘#alibaba_rds_sync_channel#’;
stop slave; start slave;

stop slave; start slave; (不一定能拉起,拉不起则使用前者)
bug修复情况:具体版本不不详
第一通道报1032错误(6SA57NE Y项目v3.5.1):
stop slave;
set global slave_exec_mode=idempotent;
start slave;
处理的客户案例:219X3DD-Y项目V3.5.1-minirds主备同步中断
问题描述:Last_IO_Error: Fatal error: Failed to call ‘slaveReply’
Last_IO_Error: Fatal error: Failed to call ‘slaveReadSyncHeader’
问题的原因:SemiSync必须在stop slave的情况下修改semisync变量,但是管控会在running的时候修改
针对这个问题有ISSUE:MySQL的Issue里有这个记录,很早就之前就发现了,解决方案是MySQL限制不允许在Slave Runing时修改;
另外此类问题在网络状态不好以及非常规操作的情况下也会发生(如实例所在主机直接关机);
aurora是有这个逻辑不对 主动下发参数修改也会导致
是否可修复:可修复 start slave; stop slave
bug修复情况:管控的代码v3.8修复
3.doublebinlog row2 io线程1236 目前原因不详
问题描述: Error reading packet from server for channel ‘#alibaba_rds_sync_channel#’: binlog truncated in the middle of event; consider out of disk space on master;
the first event ‘mysql-bin.067488’ at 154430, the last event read from ‘/home/mysql/data3001/mysql/mysql-bin.067488’ at 222179269, the last byte read from ‘/home/mysql/data3001/mysql/mysql-bin.067488’ at 222179328. (server_errno=1236)
处理原则:此问题可以先叫驻场同学先收集主备错误的,以及报错信息被读取主库binlog发生错误的binlog日志(上面报错示例主库的binlog为’/home/mysql/data3001/mysql/mysql-bin.067488’),如果报错时间过长,在本地已经被清理,可看看重oss上是否可以下载。
目前的处理情况:技术专家目前怀疑并行复制的线程bug影响了第二通道的日志读取,导致1236,目前已经在海口城市大脑几个比较经常出现此问题几个实例上做了参数修改,已验证是否是因为这个原因
stop slave;
reset slave for channel ‘#alibaba_rds_sync_channel#’;
start slave;
二、不能用命令修复的报错
第一通道报错:
1、1236
2、Slave_IO_Running: No ,
Last_IO_Error: Relay log write failure: could not queue event from master(日志不连续)
3、Slave_SQL_Running: No,
Last_Errno: 1032,
Last_Error: Could not execute Update_rows_v1 event on table prod_gdhb_dhb_gd_bpcb_0017.info_wchart_verify_code; Can’t find record in ‘info_wchart_verify_code’, Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event’s master log mysql-bin.007919, end_log_pos 83489431
【mysql主备同步问题总结】4、备库的第一通道:sql线程no,last_error:1864。
cannot schedule event update_rowa_vlrelay-log name /home/mysql/data3019/mysql/slave-relay.0003356.position 13897122 to worker thread bacaus its size 19460775 exceeds 16777216 of slave_pending_jobs_size_max.
处理步骤:
stop slave;
set global slave_pending_jobs_size_max=29360128; (具体数值和DBA商议)
start slave;

    推荐阅读