用Python进行自然语言处理-1. Language Processing and Python nlp

《用Python进行自然语言处理》是一本结合了自然语言处理和Python知识的入门书籍，现在书籍正在出第二版，预计2016年完成。第二版是与Python 3配套的，很多地方都要修改。附上书籍原地址链接：《用Python进行自然语言处理》安装过程和语料下载就不说了，这里直接开始实战： 1. 查找文本 1.1 用文本的concordance方法查找某个词。当然首先要from nltk.book import *，示例如下：

>>> text1.concordance("monstrous") Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u >>>

1.2 通常，出现在相同上下文的单词具有相似的语义倾向，我们可以用文本的similar方法获得与指定单词语义相似的那些单词。示例如下：

>>> text1.similar("monstrous") mean part maddens doleful gamesome subtly uncommon careful untoward exasperate loving passing mouldy christian few true mystifying imperial modifies contemptible >>> text2.similar("monstrous") very heartily so exceedingly remarkably as vast a great amazingly extremely good sweet >>>

1.3 如果我们想知道同一文本中两个或两个以上的单词共同出现的语境，可以用common_contexts方法。示例如下：

>>> text2.common_contexts(["monstrous", "very"]) a_pretty is_pretty am_glad be_glad a_lucky >>>

1.4 dispersion_plot方法帮我们自动绘制单词在文章的密度分布图，示例如下：

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) >>>

用Python进行自然语言处理-1. Language Processing and Python

文章图片

1.5 生成随机文本，用generate方法，目前在Python 3中无法使用，可能会有其他方法替代。

>>> text3.generate() In the beginning of his brother is a hairy man , whose top may reach unto heaven ; and ye shall sow the land of Egypt there was no bread in all that he was taken out of the month , upon the earth . So shall thy wages be ? And they made their father ; and Isaac was old , and kissed him : and Laban with his cattle in the midst of the hands of Esau thy first born , and Phichol the chief butler unto his son Isaac , she >>>

2. 计数词汇 2.1 用len方法计算文本长度，这并非真实的文本长度，因为单词和标点符号都会计数在内，用“tokens”来作为单位比较合适，中文翻译成“词例”。

>>> len(text3) 44764 >>>

2.2 用set方法获取文本的词汇表，用sorted方法进行排序。如果最外围加上len方法，就能统计文本用了多少个不同的词，NLP中把它叫做”词型“，与词例相对。

>>> sorted(set(text3)) [1] ['!', "'", '(', ')', ',', ',)', '.', '.)', ':', '; ', '; )', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...] >>> len(set(text3)) [2] 2789 >>>

2.3 用len（set（））方法和len（）方法之商，考察用词多样性。下面的例子表明text3的所有单词数量中不同单词数量只占6%，反过来说，每个单词平均使用了16次。

>>> len(set(text3)) / len(text3) 0.06230453042623537 >>>

2.4 用count方法计算某个词出现的次数。

>>> text3.count("smote") 5 >>> 100 * text4.count('a') / len(text4) 1.4643016433938312 >>>

3. 频度分布 3.1 用FreqDist方法得到某一文本所有“tokens“的频度分布，再用most_common（50）方法获得前50个高频”tokens“。还可以直接索引某个词的出现次数，这跟count方法似乎是重复的。 【用Python进行自然语言处理-1. Language Processing and Python】注意：要使用FreqDist方法，需要from nltk.probability import *才行，书上关于这一点还没有更新，照书上的做法是怎么都无法调用这个方法的。

>>> fdist1 = FreqDist(text1) [1] >>> print(fdist1) [2] >>> fdist1.most_common(50) [3] [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), ('; ', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)] >>> fdist1['whale'] 906 >>>

3.2 用plot方法绘制频度分布图，也可以绘制累积频度分布图。如：fdist1.plot(50, cumulative=True)

文章图片

3.3 获得单词长度的频度分布

>>> [len(w) for w in text1] [1] [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...] >>> fdist = FreqDist(len(w) for w in text1)[2] >>> print(fdist)[3] >>> fdist FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...}) >>> >>> fdist.most_common() [(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)] >>> fdist.max() 3 >>> fdist[3] 50223 >>> fdist.freq(3) 0.19255882431878046 >>>

4. 获得指定颗粒度的单词列表 4.1 获得长度大于15的单词列表

>>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] >>> sorted(long_words) ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly'] >>>

4.2 获得长度大于7且频度大于7的单词列表

>>> fdist5 = FreqDist(text5) >>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7) ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching'] >>>

5. 搭配和二元词 5.1 bigrams方法加上list方法可以获得二元词序列。如果没有加上list方法，结果是一个生成器而非列表。

>>> list(bigrams(['more', 'is', 'said', 'than', 'done'])) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] >>>

5.2 获得文本的搭配词

>>> text4.collocations() United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties >>> text8.collocations() would like; medium build; social drinker; quiet nights; non smoker; long term; age open; Would like; easy going; financially secure; fun times; similar interests; Age open; weekends away; poss rship; well presented; never married; single mum; permanent relationship; slim build >>>

书后有一些练习，可以用学过的知识练练手！