《用Python进行自然语言处理》是一本结合了自然语言处理和Python知识的入门书籍,现在书籍正在出第二版,预计2016年完成。第二版是与Python 3配套的,很多地方都要修改。附上书籍原地址链接:《用Python进行自然语言处理》
安装过程和语料下载就不说了,这里直接开始实战:
1. 查找文本
1.1 用文本的concordance方法查找某个词。当然首先要from nltk.book import *
,示例如下:
>>> text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ;
most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ;
for Whales of a monstrous size are oftentimes cast up dead u
>>>
1.2 通常,出现在相同上下文的单词具有相似的语义倾向,我们可以用文本的similar方法获得与指定单词语义相似的那些单词。示例如下:
>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
>>>
1.3 如果我们想知道同一文本中两个或两个以上的单词共同出现的语境,可以用common_contexts方法。示例如下:
>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty am_glad be_glad a_lucky
>>>
1.4 dispersion_plot方法帮我们自动绘制单词在文章的密度分布图,示例如下:
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>
文章图片
1.5 生成随机文本,用generate方法,目前在Python 3中无法使用,可能会有其他方法替代。
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ;
and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ;
and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
>>>
2. 计数词汇 2.1 用len方法计算文本长度,这并非真实的文本长度,因为单词和标点符号都会计数在内,用“tokens”来作为单位比较合适,中文翻译成“词例”。
>>> len(text3)
44764
>>>
2.2 用set方法获取文本的词汇表,用sorted方法进行排序。如果最外围加上len方法,就能统计文本用了多少个不同的词,NLP中把它叫做”词型“,与词例相对。
>>> sorted(set(text3)) [1]
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';
', ';
)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3)) [2]
2789
>>>
2.3 用len(set())方法和len()方法之商,考察用词多样性。下面的例子表明text3的所有单词数量中不同单词数量只占6%,反过来说,每个单词平均使用了16次。
>>> len(set(text3)) / len(text3)
0.06230453042623537
>>>
2.4 用count方法计算某个词出现的次数。
>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
>>>
3. 频度分布 3.1 用FreqDist方法得到某一文本所有“tokens“的频度分布,再用most_common(50)方法获得前50个高频”tokens“。还可以直接索引某个词的出现次数,这跟count方法似乎是重复的。 【用Python进行自然语言处理-1. Language Processing and Python】注意:要使用FreqDist方法,需要
from nltk.probability import *
才行,书上关于这一点还没有更新,照书上的做法是怎么都无法调用这个方法的。>>> fdist1 = FreqDist(text1) [1]
>>> print(fdist1) [2]
>>> fdist1.most_common(50) [3]
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';
', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
>>>
3.2 用plot方法绘制频度分布图,也可以绘制累积频度分布图。如:
fdist1.plot(50, cumulative=True)
文章图片
3.3 获得单词长度的频度分布
>>> [len(w) for w in text1] [1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)[2]
>>> print(fdist)[3]
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,
8: 9966, 9: 6428, 10: 3528, ...})
>>>
>>> fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3)
0.19255882431878046
>>>
4. 获得指定颗粒度的单词列表 4.1 获得长度大于15的单词列表
>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations',
'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness',
'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities',
'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness',
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
>>>
4.2 获得长度大于7且频度大于7的单词列表
>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']
>>>
5. 搭配和二元词 5.1 bigrams方法加上list方法可以获得二元词序列。如果没有加上list方法,结果是一个生成器而非列表。
>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
5.2 获得文本的搭配词
>>> text4.collocations()
United States;
fellow citizens;
four years;
years ago;
Federal
Government;
General Government;
American people;
Vice President;
Old
World;
Almighty God;
Fellow citizens;
Chief Magistrate;
Chief Justice;
God bless;
every citizen;
Indian tribes;
public debt;
one another;
foreign nations;
political parties
>>> text8.collocations()
would like;
medium build;
social drinker;
quiet nights;
non smoker;
long term;
age open;
Would like;
easy going;
financially secure;
fun
times;
similar interests;
Age open;
weekends away;
poss rship;
well
presented;
never married;
single mum;
permanent relationship;
slim
build
>>>
书后有一些练习,可以用学过的知识练练手!
推荐阅读
- 人工智能|hugginface-introduction 案例介绍
- 深度学习|论文阅读(《Deep Interest Evolution Network for Click-Through Rate Prediction》)
- nlp|Keras(十一)梯度带(GradientTape)的基本使用方法,与tf.keras结合使用
- NER|[论文阅读笔记01]Neural Architectures for Nested NER through Linearization
- 深度学习|2019年CS224N课程笔记-Lecture 17:Multitask Learning
- 深度学习|[深度学习] 一篇文章理解 word2vec
- 论文|预训练模型综述2020年三月《Pre-trained Models for Natural Language Processing: A Survey》
- NLP|NLP预训练模型综述
- NLP之文本表示——二值文本表示
- 隐马尔科夫HMM应用于中文分词