读取csv乱码(如何读取带?的csv())

我有CSV文件(法语), 该文件的文本行如下所示:

"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"

我使用以下代码在Python中读取了它:
import csv filepath = 'RDV.csv' try: with open(filepath, 'rU') as file: try: reader = csv.reader(x.replace('\0', '') for x in file) for row in reader: try: print(row) except Exception as ee: print ee except Exception as eee: print eee except Exception as e: print e

内容如下:
['Vend, 21 sept, 2018', '43326\x1d\x1d37089\x1d\x1d43327\x1d43328\x1d17783\x1d28884\x1d\x1d\x1d43325\x1d33381\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d5370', '\x1d\x1d\x1d\x1dNX', '651-2141\x1d\x1d652-1309\x1dNON\x1d666-3778\x1d692-2229\x1d581-300-6525\x1d622-9439\x1d\x1dNON\x1d581-998-8765\x1d827-3937\x1dSTOP\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dNON\x1d653-2541\x1d\x1d\x1dToronto', 'Roy\x1d\x1dRoy\x1d\x1dHoude\x1dOuellet\x1dFecteau\x1dRenaud\x1d\x1d\x1dBergeron\x1dLeclerc\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dBadeaux', 'Louise-Andr\x8ee\x1d\x1dAndr\x8e\x1d\x1dRichard\x1dAlexandra\x1dPauline\x1dEliane\x1d\x1d\x1dCharles-Eug\x8fne\x1dGuy\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dJacqueline', 'Vendredi, 21 septembre, 2018', '', '', '3', '37089', '', '100', '', '204-7584', 'MIller ', 'claudia', "8:30 pt ne s'est pas pr\x8esent\x8e (gastro) veut un autre rdv\x0b", '370892192018', '\x1d\x1d581-309-1309\x1d\x1d\x1d\x1d660-3064fille\x1d254-6560cel\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d650-4556']

  1. 如何将其读为纯文本而不是那些编码字符?
  2. 如何在值中查找字符?例如:
    Louise-Andr?eAndr?RichardAlexandraPaulineElianeCharles-Eug?neGuyJacqueline
编辑:
我尝试了来自snakecharmerb答案的代码, 但出现以下错误:
Traceback (most recent call last): File "< input> ", line 20, in < module> File "< input> ", line 9, in unicode_csv_reader File "< input> ", line 15, in utf_8_encoder File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 701, in next return self.reader.next() File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 632, in next line = self.readline() File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 547, in readline data = http://www.srcmini.com/self.read(readsize, firstline=True) File"/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 494, in read newchars, decodedbytes = self.decode(data, self.errors) File "/Users/simran/Documents/abc/venv/lib/python2.7/encodings/utf_16.py", line 112, in decode raise UnicodeError, "UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM

#1该文件可能被编码为UTF-16。
> > > s = '"Vend, 21 sept, 2018", "43326370894332743328177832888443325333815370", "NX", "651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto", "RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux", "Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline", "Vendredi, 21 septembre, 2018", "", "", "3", "37089", "", "100", "", "204-7584", "MIller ", "claudia", "8:30 pt ne s\'est pas présenté (gastro) veut un autre rdv", "370892192018", "581-309-1309660-3064fille254-6560cel650-4556"' > > > buf = io.BytesIO(s.decode('utf-8').encode('utf-16')) > > > next(csv.reader(buf)) Traceback (most recent call last): File "< stdin> ", line 1, in < module> _csv.Error: line contains NULL byte

Python2的csv模块不处理UTF-16, unicodecsv包也不处理。但是, 我们可以从文档中的示例中修改unicode_csv_reader:
import codecs import csv def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row]def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8')with codecs.open('french2.csv', 'rU', encoding='utf-16') as f: for row in unicode_csv_reader(f): for cell in row: print cell

代码产生以下输出(每行打印一个单元格只是为了显示带重音的字符):
Vend, 21 sept, 2018 43326370894332743328177832888443325333815370 NX 651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline Vendredi, 21 septembre, 20183 37089100204-7584 MIller claudia 8:30 pt ne s'est pas présenté (gastro) veut un autre rdv 370892192018 581-309-1309660-3064fille254-6560cel650-4556

在Python3中, 这些都不是必需的, 你可以这样做:
with open(myfile, 'r', newline='', encoding='utf-16') as f: reader = csv.reader(f) for row in reader: ...

评论
识别编码
没有通用解决方案, 猜测未知编码是一个问题。在这种情况下, 我们知道编码的文本包含空字节, 并且删除空字节会留下十六进制转义, 在该处我们希望看到带有重音的欧洲字符, 但未带重音的欧洲字符保持不变。有足够的证据表明该文件可以编码为UTF-16。对于ASCII范围内的字符, UTF-16编码有效地在ASCII字符前添加了一个空字节或在其后附加了一个空字节。
> > > u = u'André' > > > s = u.encode('utf-16-le') > > > s 'A\x00n\x00d\x00r\x00\xe9\x00'

UTF-16编码可以是big-endian或little-endian;字节序确定空字节是在ASCII字符之前还是之后。字节可包括指示字节序的字节顺序标记(BOM);在这种情况下, 可以将编码指定为UTF-16, Python将选择正确的编码。如果没有BOM, 则必须明确指定utf-16-le或utf-16-be。
字符(‘ \ uffd’ )是unicode替换字符, 用于呈现无法以所选编码显示的字符(假设str.encode的errors参数设置为’ replace’ , 无论是显式还是隐式)
> > > print s Andr?

读取csv
【读取csv乱码(如何读取带?的csv())】Python 2的csv模块不能很好地处理非ASCII编码。为了克服它的局限性, 上面的代码
  • 将文件内容从utf-16解码为unicode
  • 重新编码为utf-8(以避免空字节)
  • 将每个单元格的内容从utf-8解码为unicode
一旦内容以unicode的形式返回给程序, 就可以毫无问题地进行处理, 直到对其进行编码以写入文件或打印为止。
在Python 3中, 处理非ASCII文本要简单得多:此代码将完成所有工作:
with open('french.csv', newline='', encoding='utf-16') as f: reader = csv.reader(f) for row in f: print(row)

    推荐阅读