mysql海量数据表的归档 mysql海量数据表的归档

日常的业务系统，跟用户相关的操作数据，或者日志记录，往往会越来越多，而早期数据的查询需求没有那么强，就可以进行归档处理——作为冷数据保存，并从在线数据库中删除。
普通数据表归档简单的数据归档很好实现，比如有这样一张表：

CREATE TABLE `operation_log` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `create_date` date DEFAULT NULL, `user_id` int(11) DEFAULT NULL, `type` varchar(20) NOT NULL, `amount` decimal(20,8) NOT NULL, PRIMARY KEY (`id`), KEY `ix_date` (`create_date`), KEY `ix_type_date` (`type`,`create_date`) )

就可以按照时间查询，选出指定日期前的数据，保存后，再从数据库中删除。示例代码如下(python + sqlalchemy)：

# 已获取数据库连接connection，或者session# 查出所有旧数据 records = connection.execute(’select * from operation_log where create_date < "2022-06-01"‘)# 保存查出来的数据到文件，或者到冷备份数据库 # ......# 删除已导出的数据 connection.execute(’delete from operation_log where create_date < "2022-06-01"‘)

海量数据表归档但如果这个表中存放了海量数据，比如全表10亿条，结果集1千万条，显然就不能这样鲁莽的执行了：

1千万条结果集都加载到应用服务器是很危险的，通常会撑爆内存
'create_date'字段虽然有索引，查找速度显然也是不如主键的，

【mysql海量数据表的归档】针对问题1，解决方案就是分段查询，每次迭代查出一批，处理一批，直到处理完所有符合条件的数据；针对问题2，可以利用主键id自增的特性（按记录生成顺序增长），把大部分查询，转化为以主键为条件。代码示例如下：

# 已获取数据库连接connection，或者sessionBATCH_SIZE = 100000 # 设置批量数据为10万条 FINAL_DATE = datetime.date(2022,6,1)records = connection.execute('select id from operation_log order by id limit 1') batch_end_id = records[0]['id']# 先把迭代id初始化为 batch_end_date = datetime.date(1980,1,1) # 设置一个很早的起始日期while batch_end_date < FINAL_DATE: batch_end_id += BATCH_SIZE records = connection.execute('select * from operation_log where id<{}'.format(batch_end_id))for record in record: record_date = record['create_date'] if batch_end_date < record_date: # 获取本次迭代最晚一条记录的时间 batch_end_date = record_date if record_date < FINAL_DATE: # 保存查出来的数据到文件，或者到冷备份数据库 # ......# 删除已导出的数据 if batch_end_date >= FINAL_DATE: # 只有最后一次迭代，以date为条件，保证删除的数据不超过 connection.execute('delete from operation_log where id<{} and create_date < "{}"'.format(batch_end_id, FINAL_DATE.strftime('%Y-%m-%d'))) break else: # 正常迭代只有id作为查询条件，速度会快很多 connection.execute('delete from operation_log where id<{}'.format(batch_end_id))