当文本分析遇到乱码(à??'a?£')à??怎么办()
做文本分析经常遇到数据乱码问题,一般遇到编码问题我们无能为力,都是忽略乱码的文本。
text = open(file, errors='ignore').read()
但是这样会遗失掉一些信息,那到底怎么治文本分析时经常为非作歹的妖魔鬼怪?
心里默念python大法好!ftfy(fixes text for you)可以为我们整理的乱码数据。
安装
!pip3 install ftfy==5.6
乱码(à??'a?£')à??例子
只我在官方文档上找到这些奇形怪状的字符串,相信大家可能有的也见过这些数据。
(à??'a?£')à??
uì?nicode
Broken text…
it’
s ?ubberi?c!
HTML entities <
3
¯
\\_(?\x83\x84)_/¯
\ufeffParty like\nit’
s 1999!
LOUD NOISES
This a€” should be an em dash
This text was never UTF-8 at all\x85
\033[36;
44mI'm blue, da ba dee da ba doo...\033[0m
\u201chere\u2019s a test\u201d
This string is made of two things:\u2029 1. Unicode\u2028 2. Spite
ftfy.fix_text:专治各种不符
使用ftfy中的fix_text函数可以制伏绝大多数(à??'a?£')à
from ftfy import fix_text
fix_text("(à??'a?£')à??")
"(?'?')?"
fix_text('uì?nicode')
'ünicode'
fix_text('Broken text…
it’
s ?ubberi?c!')
"Broken text… it's flubberific!"
fix_text('HTML entities <
3')
'HTML entities <3'
fix_text("¯
\\_(?\x83\x84)_/¯
")
'ˉ\\_(ツ)_/ˉ'
fix_text('\ufeffParty like\nit’
s 1999!')
"Party like\nit's 1999!"
fix_text('LOUD NOISES')
'LOUD NOISES'
fix_text('?onico')
'único'
fix_text('This a€” should be an em dash')
'This — should be an em dash'
fix_text('This text is sad .a\x81”.')
'This text is sad .?.'
fix_text('The more you know e?? ')
'The more you know ?'
fix_text('This text was never UTF-8 at all\x85')
'This text was never UTF-8 at all…'
fix_text("\033[36;
44mI'm blue, da ba dee da ba doo...\033[0m")
"I'm blue, da ba dee da ba doo..."
fix_text('\u201chere\u2019s a test\u201d')
'"here\'s a test"'
text = "This string is made of two things:\u2029 1. Unicode\u2028 2. Spite"
fix_text(text)dd
'This string is made of two things:\n 1. Unicode\n 2. Spite'
ftfy.fix_file:专治各种不符的文件
上面的例子都是制伏字符串,实际上ftfy还可以直接处理乱码的文件。这里我就不做演示了,大家以后遇到乱码就知道有个叫fixes text for you的ftfy库可以帮助我们fix_text 和 fix_file。
近期文章
【当文本分析遇到乱码(à??'a?£')à??怎么办()】
推荐阅读
- 如何寻找情感问答App的分析切入点
- D13|D13 张贇 Banner分析
- 迷茫是人生常态
- 自媒体形势分析
- 无故.
- 基于爱,才会有“愿望”当“要求”。2017.8.12
- (全员向连载)云间当铺(一)
- 2020-12(完成事项)
- Android事件传递源码分析
- Python数据分析(一)(Matplotlib使用)