文章图片
【列表|整理了25个Python文本处理案例,收藏!】作者 | 周萝卜
来源 | 萝卜大杂烩
Python 处理文本是一项非常常见的功能,本文整理了多种文本提取及NLP相关的案例,还是非常用心的
- 提取 PDF 内容
- 提取 Word 内容
- 提取 Web 网页内容
- 读取 Json 数据
- 读取 CSV 数据
- 删除字符串中的标点符号
- 使用 NLTK 删除停用词
- 使用 TextBlob 更正拼写
- 使用 NLTK 和 TextBlob 的词标记化
- 使用 NLTK 提取句子单词或短语的词干列表
- 使用 NLTK 进行句子或短语词形还原
- 使用 NLTK 从文本文件中查找每个单词的频率
- 从语料库中创建词云
- NLTK 词法散布图
- 使用 countvectorizer 将文本转换为数字
- 使用 TF-IDF 创建文档术语矩阵
- 为给定句子生成 N-gram
- 使用带有二元组的 sklearn CountVectorize 词汇规范
- 使用 TextBlob 提取名词短语
- 如何计算词-词共现矩阵
- 使用 TextBlob 进行情感分析
- 使用 Goslate 进行语言翻译
- 使用 TextBlob 进行语言检测和翻译
- 使用 TextBlob 获取定义和同义词
- 使用 TextBlob 获取反义词列表
# pip install PyPDF2安装 PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
# Creating a pdf file object.
pdf = open("test.pdf", "rb")
# Creating pdf reader object.
pdf_reader = PyPDF2.PdfFileReader(pdf)
# Checking total number of pages in a pdf file.
print("Total number of Pages:", pdf_reader.numPages)
# Creating a page object.
page = pdf_reader.getPage(200)
# Extract data from a specific page number.
print(page.extractText())
# Closing the object.
pdf.close()
2提取 Word 内容
# pip install python-docx安装 python-docximport docx
def main():
try:
doc = docx.Document('test.docx')# Creating word reader object.
datahttps://www.it610.com/article/= ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = 'https://www.it610.com/n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
3提取 Web 网页内容
# pip install bs4安装 bs4from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
# Parsing
soup = BeautifulSoup(webpage, 'html.parser')
# Formating the parsed html file
strhtm = soup.prettify()
# Print first 500 lines
print(strhtm[:500])
# Extract meta tag value
print(soup.title.string)
print(soup.find('meta', attrs={'property':'og:description'}))
# Extract anchor tag value
for x in soup.find_all('a'):
print(x.string)
# Extract Paragraph tag value
for x in soup.find_all('p'):
print(x.text)
4读取 Json 数据
import requests
import jsonr = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
res = r.json()# Extract specific node content.
print(res['quiz']['sport'])# Dump data as string
data = https://www.it610.com/article/json.dumps(res)
print(data)
5读取 CSV 数据
import csvwith open('test.csv','r') as csv_file:
reader =csv.reader(csv_file)
next(reader) # Skip first row
for row in reader:
print(row)
6删除字符串中的标点符号
import re
import string
datahttps://www.it610.com/article/= "Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!"
# Methood 1 : Regex
# Remove the special charaters from the read string.
no_specials_string = re.sub('[!#?,.:";
]', '', data)
print(no_specials_string)
# Methood 2 : translate()
# Rake translator object
translator = str.maketrans('', '', string.punctuation)
data = https://www.it610.com/article/data.translate(translator)
print(data)
7使用 NLTK 删除停用词
from nltk.corpus import stopwords
data = https://www.it610.com/article/['Stuning even for the non-gamer: This sound track was beautiful!\
It paints the senery in your mind so well I would recomend\
it even to people who hate vid. game music! I have played the game Chrono \
Cross but out of all of the games I have ever played it has the best music! \
It backs away from crude keyboarding and takes a fresher step with grate\
guitars and soulful orchestras.\
It would impress anyone who cares to listen!']
# Remove stop words
stopwords = set(stopwords.words('english'))
output = []
for sentence in data:
temp_list = []
for word in sentence.split():
if word.lower() not in stopwords:
temp_list.append(word)
output.append(' '.join(temp_list))
print(output)
8使用 TextBlob 更正拼写
from textblob import TextBlobdatahttps://www.it610.com/article/= "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages."output = TextBlob(data).correct()
print(output)
9使用 NLTK 和 TextBlob 的词标记化
import nltk
from textblob import TextBlobdatahttps://www.it610.com/article/= "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages."nltk_output = nltk.word_tokenize(data)
textblob_output = TextBlob(data).wordsprint(nltk_output)
print(textblob_output)
Output:
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
10使用 NLTK 提取句子单词或短语的词干列表
from nltk.stem import PorterStemmer
st = PorterStemmer()
text = ['Where did he learn to dance like that?',
'His eyes were dancing with humor.',
'She shook her head and danced away',
'Alex was an excellent dancer.']
output = []
for sentence in text:
output.append(" ".join([st.stem(i) for i in sentence.split()]))
for item in output:
print(item)
print("-" * 50)
print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))
Output:
where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
11使用 NLTK 进行句子或短语词形还原
from nltk.stem import WordNetLemmatizerwnl = WordNetLemmatizer()
text = ['She gripped the armrest as he passed two cars at a time.',
'Her car was in full view.',
'A number of cars carried out of state license plates.']output = []
for sentence in text:
output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()]))for item in output:
print(item)print("*" * 10)
print(wnl.lemmatize('jumps', 'n'))
print(wnl.lemmatize('jumping', 'v'))
print(wnl.lemmatize('jumped', 'v'))print("*" * 10)
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('happiest', 'a'))
print(wnl.lemmatize('easiest', 'a'))
Output:
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy
12使用 NLTK 从文本文件中查找每个单词的频率
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
nltk.download('webtext')
wt_words = webtext.words('testing.txt')
data_analysis = nltk.FreqDist(wt_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
data_analysis = nltk.FreqDist(filter_words)
data_analysis.plot(25, cumulative=False)
Output:
[nltk_data] Downloading package webtext to
[nltk_data]C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data]Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
13从语料库中创建词云
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
nltk.download('webtext')
wt_words = webtext.words('testing.txt')# Sample data
data_analysis = nltk.FreqDist(wt_words)
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
wcloud = WordCloud().generate_from_frequencies(filter_words)
# Plotting the wordcloud
plt.imshow(wcloud, interpolation="bilinear")
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()
14NLTK 词法散布图
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
words = ['data', 'science', 'dataset']
nltk.download('webtext')
wt_words = webtext.words('testing.txt')# Sample data
points = [(x, y) for x in range(len(wt_words))
for y in range(len(words)) if wt_words[x] == words[y]]
if points:
x, y = zip(*points)
else:
x = y = ()
plt.plot(x, y, "rx", scalex=.1)
plt.yticks(range(len(words)), words, color="b")
plt.ylim(-1, len(words))
plt.title("Lexical Dispersion Plot")
plt.xlabel("Word Offset")
plt.show()
15使用 countvectorizer 将文本转换为数字
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."
df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})
# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])
# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())
# Change column headers
df2.columns = df1.columns
print(df2)
Output:
GoJavaPython
and222
application010
are101
bytecode010
can010
code010
comes101
compiled010
derived010
develops010
for020
from010
functional101
imperative101
...
16使用 TF-IDF 创建文档术语矩阵
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})# Initialize
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())# Change column headers
df2.columns = df1.columns
print(df2)
Output:
GoJavaPython
and0.3237510.1375530.323751
application0.0000000.1164490.000000
are0.2084440.0000000.208444
bytecode0.0000000.1164490.000000
can0.0000000.1164490.000000
code0.0000000.1164490.000000
comes0.2084440.0000000.208444
compiled0.0000000.1164490.000000
derived0.0000000.1164490.000000
develops0.0000000.1164490.000000
for0.0000000.2328980.000000
...
17为给定句子生成 N-gram NLTK
import nltk
from nltk.util import ngrams# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
n_grams = ngrams(nltk.word_tokenize(data), num)
return [ ' '.join(grams) for grams in n_grams]data = 'https://www.it610.com/article/A class is a blueprint for the object.'print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))
TextBlob
from textblob import TextBlob
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
n_grams = TextBlob(data).ngrams(num)
return [ ' '.join(grams) for grams in n_grams]
data = 'https://www.it610.com/article/A class is a blueprint for the object.'
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))
Output:
1-gram:['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram:['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram:['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram:['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
18使用带有二元组的 sklearn CountVectorize 词汇规范
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Sample data for analysis
data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them."
data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language"
df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]})
# Initialize
vectorizer = CountVectorizer(ngram_range=(2, 2))
doc_vec = vectorizer.fit_transform(df1.iloc[0])
# Create dataFrame
df2 = pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())
# Change column headers
df2.columns = df1.columns
print(df2)
Output:
AssemblyMachine
also either01
and or01
are also01
are readable10
are still10
assembly language50
because each10
but difficult01
by computers01
by people01
can execute01
...
19使用 TextBlob 提取名词短语
from textblob import TextBlob#Extract noun
blob = TextBlob("Canada is a country in the northern part of North America.")for nouns in blob.noun_phrases:
print(nouns)
Output:
canada
northern part
america
20如何计算词-词共现矩阵
import numpy as np
import nltk
from nltk import bigrams
import itertools
import pandas as pd
def generate_co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)
vocab_index = {word: i for i, word in enumerate(vocab)}
# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))
# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
# Loop through the bigrams taking the current and previous word,
# and the number of occurrences of the bigram.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_index[current]
pos_previous = vocab_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count
co_occurrence_matrix = np.matrix(co_occurrence_matrix)
# return the matrix and the index
return co_occurrence_matrix, vocab_index
text_data = https://www.it610.com/article/[['Where', 'Python', 'is', 'used'],
['What', 'is', 'Python' 'used', 'in'],
['Why', 'Python', 'is', 'best'],
['What', 'companies', 'use', 'Python']]
# Create one list using many lists
data = https://www.it610.com/article/list(itertools.chain.from_iterable(text_data))
matrix, vocab_index = generate_co_occurrence_matrix(data)
data_matrix = pd.DataFrame(matrix, index=vocab_index,
columns=vocab_index)
print(data_matrix)
Output:
bestuseWhatWhere...inisPythonused
best0.00.00.00.0...0.00.00.01.0
use0.00.00.00.0...0.01.00.00.0
What1.00.00.00.0...0.00.00.00.0
Where0.00.00.00.0...0.00.00.00.0
Pythonused0.00.00.00.0...0.00.00.01.0
Why0.00.00.00.0...0.00.00.01.0
companies0.01.00.01.0...1.00.00.00.0
in0.00.00.00.0...0.00.01.00.0
is0.00.01.00.0...0.00.00.00.0
Python0.00.00.00.0...0.00.00.00.0
used0.00.01.00.0...0.00.00.00.0
[11 rows x 11 columns]
21使用 TextBlob 进行情感分析
from textblob import TextBlobdef sentiment(polarity):
if blob.sentiment.polarity < 0:
print("Negative")
elif blob.sentiment.polarity > 0:
print("Positive")
else:
print("Neutral")blob = TextBlob("The movie was excellent!")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)blob = TextBlob("The movie was not bad.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)blob = TextBlob("The movie was ridiculous.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
Output:
Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative
22使用 Goslate 进行语言翻译
import goslatetext = "Comment vas-tu?"gs = goslate.Goslate()translatedText = gs.translate(text, 'en')
print(translatedText)translatedText = gs.translate(text, 'zh')
print(translatedText)translatedText = gs.translate(text, 'de')
print(translatedText)
23使用 TextBlob 进行语言检测和翻译
from textblob import TextBlob
blob = TextBlob("Comment vas-tu?")
print(blob.detect_language())
print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))
Output:
fr
?Como estas tu?
How are you?
你好吗?
24使用 TextBlob 获取定义和同义词
from textblob import TextBlob
from textblob import Word
text_word = Word('safe')
print(text_word.definitions)
synonyms = set()
for synset in text_word.synsets:
for lemma in synset.lemmas():
synonyms.add(lemma.name())print(synonyms)
Output:
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}
25使用 TextBlob 获取反义词列表
from textblob import TextBlob
from textblob import Wordtext_word = Word('safe')antonyms = set()
for synset in text_word.synsets:
for lemma in synset.lemmas():
if lemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())print(antonyms)
Output:
{'dangerous', 'out'}

文章图片
往
期
回
顾
资讯
谷歌使出禁用2G大招
技术
干货满满的python实战项目!
技术
Python?写了一个网页版的P图软件
技术
11款可替代top命令的工具!

文章图片
分享

文章图片
点收藏

文章图片
点点赞

文章图片
点在看
推荐阅读
- 大数据|python自然语言处理库_8个出色的Python库用于自然语言处理
- Python库之自然语言处理和文本挖掘
- 报错|安装PyTorch后jupyter notebook中仍出现“No module named torch“
- Python数据可视化|python 数据可视化01
- python|OpenCV-Python实战(21)——OpenCV人脸检测项目在Web端的部署
- vue|vue基础语法
- python|实现协程的方式及协程的意义 【笔记】
- python|游标的简单使用及案例【笔记二】
- 关于Python中格式化符号的基本使用方法说明