[Python]|[Python] 大文件数据读取分析 [Python]大文件数据读取分析

首先我们可以确定的是不能用read()与readlines()函数；
因为如果将这两个函数均将数据全部读入内存，会造成内存不足的情况。
针对数据按行划分的文件
以计算行数为例，首先针对几种不同的方法来作比较：
1、使用for遍历的方法，比较美观，网上搜索到八有十九是让你这样做，尽管已经很快了但还不是最快的
start = time.time()
with open(dataPath, 'r') as f:
count = 0
for line in f:
count += 1
print(count)
print(time.time() - start)
输出：
5000
0.09386205673217773
2、使用readline()模拟遍历，发现其实结果和第一种差不多
start = time.time()
with open(dataPath, 'r') as f:
line = f.readline()
count = 1
while line:
count += 1
line = f.readline()
print(count - 1)
print(time.time() - start)
输出：
5000
0.09433221817016602
3、对比readlines()直接去访问，结果却更慢了！
start = time.time()
with open(dataPath, 'r') as f:
count = 0
for line in f.readlines():
count += 1
print(count)
print(time.time() - start)
输出：
5000
0.12223696708679199
4、不断去检测文件指针位置，有的时候我们可能需要读到特定的文件位置就停下；就会发现tell()十分耗时！
start = time.time()
with open(dataPath, 'r') as f:
count = 0
while f.tell() < datasize:
f.readline()
count += 1;
print(count)
print(time.time() - start)
输出：
5000
0.29171299934387207
5、使用mmap的方法，mmap是一种虚拟内存映射文件的方法，即将一个文件或者其它对象映射到进程的地址空间，实现文件磁盘地址和进程虚拟地址空间中一段虚拟地址的一一对映关系。通过建立一个文件的内存映射将使用操作系统虚拟内存来直接访问文件系统上的数据，而不是使用常规的I/O函数访问数据。内存映射通常可以提供I/O性能，因为使用内存映射是，不需要对每个访问都建立一个单独的系统调用，也不需要在缓冲区之间复制数据；实际上，内核和用户应用都能直接访问内存，是目前测到最快的方法。
import mmap
start = time.time()
with open(dataPath, "r") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)
count = 0
while map.readline():
count += 1
print(count)
map.close()
print(time.time() - start)
输出：
5000
0.023865938186645508
6、可以不按行读取，而是按块读取，然后分析\n的个数，但这只针对计算行数而论可行，但我们真正想要的是按行读取数据，所以这里只给出实现方法，不进行对比。
with open(r"d:\lines_test.txt",'rb') as f:
count = 0
while True:
buffer = f.read(1024 * 8192)
if not buffer:
break
count += buffer.count('\n')
print count
考虑MPI的情况
当文件很大的时候，任务又需要并行化的话，我们可以将文件拆分成多段去处理，例如对于4核的电脑，可以让4条进程分别去处理文件不同的部分，每条进程读四分之一的数据。但这时候要考虑到，分割点不一定刚好是换行符的情况，所以我们可以考虑从分割点下一个换行符开始搜索，分割点第一个换行符之前的交给前一个进程去处理，处理方法如图：

文章图片
实现类似：
from mpi4py import MPI
import platform
import sys
import io
import os
import mmap
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
comm = MPI.COMM_WORLD
comm_size = comm.size
comm_rank = comm.rank
with open(filePath, 'r', encoding='utf-8') as f:
# Set the file pointer to the beginning of a line after blockSize * rank
# Use mmap to run faster
map = mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)
map.seek(comm_rank * blockSize)
if comm_rank != 0:
map.readline()
# Each process handle about blocksize lines.
blockEnd = (comm_rank + 1) * blockSize
# Use index here to avoid using twice map.tell()
index = map.tell()
while index <= blockEnd and index < dataSize:
# line = map.readline().translate(None, b'\x00').decode()
line = map.readline().decode('utf-8')
index = map.tell()
try:
dosomething(line)
except Exception as err:
print(err)
continue
如果不用mmap.tell()改用f.tell()的话，效率其差，一开始我遇到这种情况的时候是想着自己不断去加len(line)去自己计算文件指针的位置的。但又发现一个问题，file.readline()会帮你去除部分字符，例如\r\n只会保留\n，而mmap.readline()则不会，而且试过表示很难，总是和f.tell()对不齐。
数据按特殊符号划分
考虑到可能数据划分点不是\n, 我们可以这样读取：
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>>for r in rows(f):
>>>process(row)
"""
curr_row = ''
while True:
chunk = f.read(chunksize)
if chunk == '': # End of file
yield curr_row
break
while True:
i = chunk.find(sep)
if i == -1:
break
yield curr_row + chunk[:i]
curr_row = ''
chunk = chunk[i+1:]
curr_row += chunk
数据无特定划分方式
一种方法是用yield：
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = https://www.it610.com/article/file_object.read(chunk_size)
if not data:
break
yield data
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
另外一种方法是用iter和一个helper function：
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
【[Python]|[Python] 大文件数据读取分析】process_data(piece)

[Python]|[Python] 大文件数据读取分析

推荐阅读

八图H5邀请函兴起的原因

中考志愿怎么填第一批第二批中考志愿怎么填

打完新冠疫苗要检查抗体吗

汽车发动机怠速不稳定是什么原因发动机怠速不稳定是什么原因有哪些

笔记本|ROG Zephyrus M16 性能全面升级，兼具台式机级性能

难忘的父亲节初中作文精选3篇

酷我音

广州陶瓷餐具价格怎么样广州陶瓷餐具哪里有卖

宜昌一次性吸纳就业补贴对象+电话宜昌市应届毕业生补贴

2022世界杯沙特阿拉伯队阵容名单世界杯沙特阿拉伯队首发阵容

高三优秀议论文优秀议论文

excel如何插入word文档

志高空调遥控器怎么调制热,注意这些控制技巧

步步惊心是什么意思步步惊心的含义

深圳北站在哪里打的深圳北站是哪里

信用卡欠了钱5万实在没钱还会坐牢吗

真后悔养了博美养博美犬的好处和坏处

大学期末考试试卷分析,试卷分析失分原因和改进措施

住房风水禁忌100例房屋后面风水禁忌

美的空调不同产地,每个人都应该了解这些