关于PostgreSQL同步复制下主从切换时的数据丢失问题关于PostgreSQL同步复制下主从切

背景MySQL 5.7对半同步复制做了一个增强，增加了一个rpl_semi_sync_master_wait_point参数控制master什么时候等待slave的应答。
默认是AFTER_SYNC，即事务提交时，master按照下面的顺序执行
1. InnoDB prepare
2. 写事务到binlog以及发送到slave
3. binlog刷盘
4. 等待slave的接受应答
5. InnoDB commit
6. 返回客户端

rpl_semi_sync_master_wait_point的另外一个取值是AFTER_COMMIT，这也是MySQL 5.6及之前版本的行为
1. InnoDB prepare
2. 写事务到binlog以及发送到slave
3. binlog刷盘
4. InnoDB commit
5. 等待slave的接受应答
6. 返回客户端

简言之，AFTER_SYNC和AFTER_COMMIT的区别就是交换了下4和5的顺序。
这么做主要可以回避MySQL5.6 及以前的版本中存在的这样一个问题：
在事务已经提交到存储引擎但还没有收到slave的应答的期间，其它会话可以看到这个更新，如果此时master宕机，
slave升级为新的master后，可能会丢失这个事务，那么对于那些已经看到过这个事务的修改的会话来说就是发生了“数据丢失”。
详细参考:http://dev.mysql.com/doc/refman/5.7/en/replication-semisync.html

那么，对于PostgreSQL来说，有没有这个问题呢？

PostgreSQL的同步复制PostgreSQL的同步复制过程大体如下：
1. 写WAL记录到本地WAL文件和slave
2. 刷WAL记录到本地WAL文件
3. 更新CLOG
4. 等待slave的接受应答
5. 返回客户端

这和MySQL5.6或者是5.7的AFTER_COMMIT类似（参考函数RecordTransactionCommit()），所以也应该存在丢失数据的问题。那么事实是不是这样呢？

实验验证经过测试验证，确实存在前面担心的问题。测试过程如下:
1. 配置基于Pacemaker+Corosync的1主2从集群
步骤参考 http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster

2. 同时执行下面2个测试脚本，验证主从切换时有没有数据丢失
failover_test.py
循环更新一个字段，每次加1。发生故障后，重建连接，等连接成功后，检查新master中的数值和最后一次更新的数据是否一致。这个数值等于最后一次成功的更新或最后一次失败的更新中的任意一个都说明没有丢数据。
点击(此处)折叠或打开

#!/usr/bin/python
import psycopg2
import time
url="host=192.168.1.246 dbname=postgres user=postgres"
conn = psycopg2.connect(url)
conn.autocommit=True
cur = conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS pgsql_ha_test(id serial PRIMARY KEY, num integer); ")
cur.execute("truncate pgsql_ha_test; ")
cur.execute("insert into pgsql_ha_test values(1,0); ")
print "Update process had started,please kill the master..."
i=0
num=-1
try:
while True:
i+=1
cur.execute("UPDATE pgsql_ha_test set num = %s where id=1",(i,))
num=i
if i % 1000 == 0:
print time.time()," current num=",num
except psycopg2.Error as e:
print time.time(),"The master has down, last num:",num
print e.pgerror
conn.close()
time1= time.time()
connect_success = False
while connect_success == False:
try:
conn = psycopg2.connect(url)
conn.autocommit=True
connect_success=True
except psycopg2.Error as e:
pass
time2= time.time()
print time.time()," connect success after %f second"%(time2-time1)
cur = conn.cursor()
cur.execute("select num from pgsql_ha_test where id=1; ")
newnum = cur.fetchone()[0]
print "current num:",newnum
if not (newnum==num or newnum==num+1):
print "NG"
exit(1)
cur.close()
conn.close()
print "OK"

failover_test2.py
循环读取failover_test.py更新的值。发生故障后，重建连接，等连接成功后，比较新master中的数值和最后一次成功读到的值。新值大于或等于最后一次成功读到的值都说明没有丢数据。点击(此处)折叠或打开

#!/usr/bin/python
import psycopg2
import time
url="host=192.168.1.246 dbname=postgres user=postgres"
conn = psycopg2.connect(url)
conn.autocommit=True
cur = conn.cursor()
print "Read process had started..."
i=0
num=-1
try:
while True:
i+=1
cur.execute("select num from pgsql_ha_test where id=1; ")
num = cur.fetchone()[0]
if i % 10000 == 0:
print time.time()," current num=",num
except psycopg2.Error as e:
print time.time()," The master has down, last num=",num
print e.pgerror
conn.close()
time1= time.time()
connect_success = False
while connect_success == False:
try:
conn = psycopg2.connect(url)
conn.autocommit=True
connect_success=True
except psycopg2.Error as e:
time.sleep(1)
time2= time.time()
print time.time()," connect success after %f second"%(time2-time1)
cur = conn.cursor()
cur.execute("select num from pgsql_ha_test where id=1; ")
newnum = cur.fetchone()[0]
print time.time()," current num:",newnum
if newnumnum:
print "NG: Data Lost!"
exit(1)
cur.close()
conn.close()
print "OK"

3. 杀掉master上的postgres进程，模拟故障触发主从切换
点击(此处)折叠或打开

[root@node73 ~]# ps -ef|grep postgres
root 1228 25775 0 14:09 pts/2 00:00:00 su - postgres
postgres 1229 1228 0 14:09 pts/2 00:00:00 -bash
postgres 3339 31673 17 14:10 ? 00:00:07 postgres: postgres postgres 192.168.1.245(41991) UPDATE waiting for 1/FE556260
postgres 4105 31673 41 14:10 ? 00:00:11 postgres: postgres postgres 192.168.1.245(41994) idle
root 5612 25754 0 14:11 pts/1 00:00:00 grep postgres
postgres 9168 31673 0 13:49 ? 00:00:05 postgres: wal sender process postgres 192.168.1.248(53883) streaming 1/FE556260
postgres 24589 31673 0 13:57 ? 00:00:03 postgres: wal sender process postgres 192.168.1.245(41904) streaming 1/FE5562E8
postgres 31673 1 0 13:45 ? 00:00:00 /opt/PostgreSQL/9.4.5/bin/postgres -D /data/pgsql -c config_file=/data/pgsql//postgresql.conf -p 5432
postgres 31694 31673 0 13:45 ? 00:00:00 postgres: logger process
postgres 31706 31673 0 13:45 ? 00:00:00 postgres: checkpointer process
postgres 31707 31673 0 13:45 ? 00:00:00 postgres: writer process
postgres 31708 31673 0 13:45 ? 00:00:00 postgres: wal writer process
postgres 31709 31673 0 13:45 ? 00:00:00 postgres: autovacuum launcher process
postgres 31710 31673 0 13:45 ? 00:00:00 postgres: stats collector process
[root@n73 ~]# kill -9 31673
[root@node73 ~]# ps -ef|grep postgres
root 1228 25775 0 14:09 pts/2 00:00:00 su - postgres
postgres 1229 1228 0 14:09 pts/2 00:00:00 -bash
postgres 4105 1 38 14:10 ? 00:00:18 postgres: postgres postgres 192.168.1.245(41994) idle
root 7017 25754 0 14:11 pts/1 00:00:00 grep postgres
postgres 31694 1 0 13:45 ? 00:00:00 postgres: logger process

kill -9 后可能会有postgres进程残留，这是PostgreSQL的老问题了，使用普通的kill就不会残留了。

4. 检查2个测试脚本的输出
failover_test.py脚本的测试jiego 从执行更新的会话来看，主从切换后没有丢失任何数据。
点击(此处)折叠或打开

[postgres@node69 pgsql_ha]$ ./failover_test.py
Update process had started,please kill the master...
1447049434.72current num= 1000
1447049437.25 current num= 2000
1447049439.38 current num= 3000
1447049441.58 current num= 4000
1447049443.71 current num= 5000
1447049445.76 current num= 6000
1447049447.7 current num= 7000
1447049449.69 current num= 8000
1447049451.42 current num= 9000
1447049453.41 current num= 10000
1447049455.31 current num= 11000
1447049457.32 current num= 12000
1447049459.04 current num= 13000
1447049461.02 current num= 14000
1447049463.05 current num= 15000
1447049464.82 current num= 16000
1447049466.71 current num= 17000
1447049469.49 current num= 18000
1447049471.5 current num= 19000
1447049473.68 current num= 20000
1447049475.87 current num= 21000
1447049478.19 Tcurrent num= 22000
1447049480.2 current num= 23000
1447049482.18 current num= 24000
1447049484.01 current num= 25000
1447049485.23 The master has down, last num: 25529
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
1447049504.15 connect success after 18.918158 second
current num: 25529
OK

failover_test2.py的测试结果是NG的。即从旁路的会话来看，曾经看到过的一次数据更新消失了，即发生了数据丢失。
【关于PostgreSQL同步复制下主从切换时的数据丢失问题】 点击(此处)折叠或打开

[postgres@node69 pgsql_ha]$ ./failover_test2.py
Read process had started...
1447049449.34 current num= 7828
1447049453.46 current num= 10040
1447049457.62 current num= 12162
1447049461.85 current num= 14456
1447049465.87 current num= 16492
1447049469.8 current num= 18165
1447049473.84 current num= 20056
1447049477.83 current num= 21867
1447049481.9 current num= 23860
1447049485.89 current num= 25530
1447049501.8 The master has down, last num= 25530
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
1447049501.81 connect success after 0.006506 second
1447049501.81 current num: 25529
NG: Data Lost!

问题的危害关于这个问题的危害程度，要视情况而定了。
如果看到那次数据更新的会话，仅仅是看看，一点问题没有。如果它是利用看到的数据更新了其它表，那么也没事，因为这个更新操作必然会失败，它的WAL记录不会被传递到slave上。但是如果它把看到那个值缓存下来或者传递到其它数据系统上，那就有可能带来数据不一致了。
那么PostgreSQL为什么要这么实现呢？也许是为了尽快释放锁，也许是别的原因。之后准备发邮件到社区问问。
另外，这个问题发生的概率比较低，测试时以kill -9的方式杀进程出现这个问题的概率要高一些(可能和postgres进程残留有关)，实际场景的数据库crash没这么高。