我有CSV文件(法语), 该文件的文本行如下所示:
"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"
我使用以下代码在Python中读取了它:
import csv
filepath = 'RDV.csv'
try:
with open(filepath, 'rU') as file:
try:
reader = csv.reader(x.replace('\0', '') for x in file)
for row in reader:
try:
print(row)
except Exception as ee:
print ee
except Exception as eee:
print eee
except Exception as e:
print e
内容如下:
['Vend, 21 sept, 2018', '43326\x1d\x1d37089\x1d\x1d43327\x1d43328\x1d17783\x1d28884\x1d\x1d\x1d43325\x1d33381\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d5370', '\x1d\x1d\x1d\x1dNX', '651-2141\x1d\x1d652-1309\x1dNON\x1d666-3778\x1d692-2229\x1d581-300-6525\x1d622-9439\x1d\x1dNON\x1d581-998-8765\x1d827-3937\x1dSTOP\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dNON\x1d653-2541\x1d\x1d\x1dToronto', 'Roy\x1d\x1dRoy\x1d\x1dHoude\x1dOuellet\x1dFecteau\x1dRenaud\x1d\x1d\x1dBergeron\x1dLeclerc\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dBadeaux', 'Louise-Andr\x8ee\x1d\x1dAndr\x8e\x1d\x1dRichard\x1dAlexandra\x1dPauline\x1dEliane\x1d\x1d\x1dCharles-Eug\x8fne\x1dGuy\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dJacqueline', 'Vendredi, 21 septembre, 2018', '', '', '3', '37089', '', '100', '', '204-7584', 'MIller ', 'claudia', "8:30 pt ne s'est pas pr\x8esent\x8e (gastro) veut un autre rdv\x0b", '370892192018', '\x1d\x1d581-309-1309\x1d\x1d\x1d\x1d660-3064fille\x1d254-6560cel\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d650-4556']
- 如何将其读为纯文本而不是那些编码字符?
- 如何在值中查找字符?例如:
Louise-Andr?eAndr?RichardAlexandraPaulineElianeCharles-Eug?neGuyJacqueline
我尝试了来自snakecharmerb答案的代码, 但出现以下错误:
Traceback (most recent call last):
File "<
input>
", line 20, in <
module>
File "<
input>
", line 9, in unicode_csv_reader
File "<
input>
", line 15, in utf_8_encoder
File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 701, in next
return self.reader.next()
File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 632, in next
line = self.readline()
File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 547, in readline
data = http://www.srcmini.com/self.read(readsize, firstline=True)
File"/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 494, in read
newchars, decodedbytes = self.decode(data, self.errors)
File "/Users/simran/Documents/abc/venv/lib/python2.7/encodings/utf_16.py", line 112, in decode
raise UnicodeError, "UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM
#1该文件可能被编码为UTF-16。
>
>
>
s = '"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s\'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"'
>
>
>
buf = io.BytesIO(s.decode('utf-8').encode('utf-16'))
>
>
>
next(csv.reader(buf))
Traceback (most recent call last):
File "<
stdin>
", line 1, in <
module>
_csv.Error: line contains NULL byte
Python2的csv模块不处理UTF-16, unicodecsv包也不处理。但是, 我们可以从文档中的示例中修改unicode_csv_reader:
import codecs
import csv def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode;
encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')with codecs.open('french2.csv', 'rU', encoding='utf-16') as f:
for row in unicode_csv_reader(f):
for cell in row:
print cell
代码产生以下输出(每行打印一个单元格只是为了显示带重音的字符):
Vend, 21 sept, 2018
43326370894332743328177832888443325333815370
NX
651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto
RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux
Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline
Vendredi, 21 septembre, 20183
37089100204-7584
MIller
claudia
8:30 pt ne s'est pas présenté (gastro) veut un autre rdv
370892192018
581-309-1309660-3064fille254-6560cel650-4556
在Python3中, 这些都不是必需的, 你可以这样做:
with open(myfile, 'r', newline='', encoding='utf-16') as f:
reader = csv.reader(f)
for row in reader:
...
评论
识别编码
没有通用解决方案, 猜测未知编码是一个问题。在这种情况下, 我们知道编码的文本包含空字节, 并且删除空字节会留下十六进制转义, 在该处我们希望看到带有重音的欧洲字符, 但未带重音的欧洲字符保持不变。有足够的证据表明该文件可以编码为UTF-16。对于ASCII范围内的字符, UTF-16编码有效地在ASCII字符前添加了一个空字节或在其后附加了一个空字节。
>
>
>
u = u'André'
>
>
>
s = u.encode('utf-16-le')
>
>
>
s
'A\x00n\x00d\x00r\x00\xe9\x00'
UTF-16编码可以是big-endian或little-endian;字节序确定空字节是在ASCII字符之前还是之后。字节可包括指示字节序的字节顺序标记(BOM);在这种情况下, 可以将编码指定为UTF-16, Python将选择正确的编码。如果没有BOM, 则必须明确指定utf-16-le或utf-16-be。
字符(‘ \ uffd’ )是unicode替换字符, 用于呈现无法以所选编码显示的字符(假设str.encode的errors参数设置为’ replace’ , 无论是显式还是隐式)
>
>
>
print s
Andr?
读取csv
【读取csv乱码(如何读取带?的csv())】Python 2的csv模块不能很好地处理非ASCII编码。为了克服它的局限性, 上面的代码
- 将文件内容从utf-16解码为unicode
- 重新编码为utf-8(以避免空字节)
- 将每个单元格的内容从utf-8解码为unicode
在Python 3中, 处理非ASCII文本要简单得多:此代码将完成所有工作:
with open('french.csv', newline='', encoding='utf-16') as f:
reader = csv.reader(f)
for row in f:
print(row)
推荐阅读
- Python Pandas将字符串与groupby结合在一起
- 大数据|大模型炼丹无从下手(谷歌、OpenAI烧了几百万刀,总结出这些方法论…)
- 图像分类|保姆级使用PyTorch训练与评估自己的Wide ResNet网络教程
- ListView.builder如果appbar为null则显示可用空间
- Android(为什么本机代码比Java代码要快得多)
- 如何使用自定义历史记录对象在我的主要App组件中侦听路线更改()
- 如何使用mockito在安卓系统中创建模拟api响应[关闭]。
- 在android中添加自定义单选按钮
- 如何在Android Studio的菜单应用程序标题栏中的项目之间添加行()