Python中的词干和词法化 _数据科学

词干和词法归化是自然语言处理领域中的文本归一化(或有时称为单词归一化)技术, 用于准备文本, 单词和文档以进行进一步处理。自1960年代以来, 已经研究了词干和词法分解, 并在计算机科学中开发了算法。在本教程中, 你将以一种实用的方法来学习词干和词法化, 包括背景, 一些著名的算法, 词干法和词法化的应用程序, 以及如何使用Python nltk软件包(自然语言工具)对词, 句子和文档进行词干和词法化。 Python提供的用于自然语言处理任务的工具包。

背景
使用Python nltk包进行词干分析
- 什么是Python nltk软件包
使用Python nltk包进行最小化
词根和词法的应用

背景
我们说和写的语言是由经常彼此衍生的几个词组成的。当一种语言包含从另一个单词派生而来的单词时, 这些单词在语音中的使用会发生变化, 这称为” 反语” 。

“ 在语法中, 词尾变化是对单词的修饰, 以表示不同的语法类别, 例如时态, 大小写, 语音, 方面, 人, 数字, 性别和语气。词尾变化表示一个或多个带有前缀, 后缀或中缀的语法类别。 , 或其他内部修改, 例如元音更改” [Wikipedia]

语言的变化程度可以更高或更低。当你阅读了关于语法的词尾变化定义时, 你可以理解, 一个词尾变化将具有共同的词根形式。让我们看几个例子,

文章图片
以上示例必须帮助你理解文本规范化的概念, 尽管文本规范化不仅限于书面文档, 还不仅限于语音。词干和词法化帮助我们获得变形(衍生)词的词根形式(有时在搜索上下文中称为同义词)。词干与词法化在产生词的词根形式和产生词的方法上有所不同。
词干和词法分解被广泛应用于标记系统, 索引, SE??O, Web搜索结果和信息检索中。例如, 在Google上搜索鱼也将导致钓鱼, 因为钓鱼是这两个词的词干, 所以钓鱼。在本教程的后面, 你将了解应用程序中的词干和词法分解的一些重要用法。
使用Python nltk包进行词干分析

“ 词干化是减少单词的词根变化至其词根形式的过程, 例如将词组映射到同一词干, 即使词干本身不是该语言中的有效词。”

词根(词根)是单词的一部分, 在词根中添加了(-ed, -ize, -s, -de, mis)等词尾(变化/派生)词缀。因此, 阻止单词或句子可能会导致单词不是实际单词。通过删除单词所使用的后缀或前缀来创建词干。

信息：从单词中删除后缀称为后缀剥离

什么是Python nltk软件包？自然语言工具包(NLTK)是一个Python库, 用于使程序可以使用自然语言。它为超过50个语料库和词汇资源(如WordNet Word存储库)的数据集提供了用户友好的界面。该库可以执行不同的操作, 例如标记化, 词干提取, 分类, 解析, 标记和语义推理。
最新版本是NLTK 3.3。学生, 研究人员和工业家都可以使用它。它是一个开放源代码和免费库。它适用于Windows, Mac OS和Linux。
安装python nltk NLTK需要Python版本2.7、3.4、3.5或3.6。如果未在Python安装中安装nltk, 则可以使用pip安装程序进行安装。要测试安装：

打开Python IDE或CLI界面(通常使用的那个)
键入import nltk, 如果未显示缺少nltk的消息, 则按Enter键, 然后在计算机上安装了nltk。

Mac Os / LINX安装运行sudo pip install -U nltk在Mac或Linux上安装nltk
视窗在cmd.exe bash上运行pip install nltk以在Windows上安装nltk。
现在, 安装后, 你可以使用nltk库进行Python的词干和词法化。
安装后, nltk还提供测试数据集以在” 自然语言处理” 中使用。你可以在Python中使用以下命令来下载它：

#import the nltk package import nltk #call the nltk downloader nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xmlTrue

nltk.download()将调用图形窗口, 该窗口将显示以选择不同的语料库和数据集。

文章图片
单击模型选项卡, 然后选择punkt, 然后单击下载。在本教程的后面, 你将需要此模型。
提取算法和代码 nltk软件包提供英语和非英语词干。

词干的计算机程序或子例程可以称为词干程序, 词干算法或词干分析器。

本教程将了解Python nltk中以不同语言提供的不同词干提取器。对于英语, 你可以在PorterStammer或LancasterStammer之间进行选择, PorterStemmer是最早于1979年开发的语言。LancasterStemmer于1990年开发, 使用的方法比Porter Stemming Algorithm更具侵略性。让我们尝试一下PorterStemmer来阻止单词, 然后你将看到它如何阻止单词。本教程将不深入介绍Porter Stemmer和LancasterStemmer(也称为(Paice-Husk Stemmer))的算法, 但是你将看到它们的优缺点。
词干

from nltk.stem import PorterStemmer from nltk.stem import LancasterStemmer

nltk.stem是一个使用不同类执行词干分析的程序包。 PorterStemmer是此类之一, 因此我们使用上述代码行将其导入。

#create an object of class PorterStemmer porter = PorterStemmer() lancaster=LancasterStemmer() #proide a word to be stemmed print("Porter Stemmer") print(porter.stem("cats")) print(porter.stem("trouble")) print(porter.stem("troubling")) print(porter.stem("troubled")) print("Lancaster Stemmer") print(lancaster.stem("cats")) print(lancaster.stem("trouble")) print(lancaster.stem("troubling")) print(lancaster.stem("troubled"))

Porter Stemmer cat troubl troubl troubl Lancaster Stemmer cat troubl troubl troubl

【Python中的词干和词法化】PorterStemmer使用后缀剥离法产生茎。注意, PorterStemmer是如何通过简单地删除cat之后的” s” 来赋予单词” cats” 的根(词根)的。这是添加到cat的后缀, 以使其复数。但是, 如果你查看” 麻烦” , “ 麻烦” 和” 麻烦” , 它们就会被归类为” 麻烦” , 因为** PorterStemmer算法不遵循语言学, 而是针对不同情况分阶段适用的一套05条规则(分步实施) )以生成茎**。这就是PorterStemmer经常不生成实际英语单词的词干的原因。它不保留单词实际词干的查找表, 而是应用算法规则生成词干。它使用规则来决定删除后缀是否明智。可以为任何一种语言生成一套自己的规则, 这就是Python nltk为何引入SnowballStemmers来创建非英语词干的原因！
那么为什么要使用它呢？ PorterStemmer以其简单性和速度而闻名。它在称为IR环境的信息检索环境中通常非常有用, 可以快速调用和获取搜索查询。在典型的IR中, 环境文档被表示为单词或术语的向量。具有相同词干的词将具有相似的含义。例如,
连接
连接— — > 连接
已连接— — > 连接
连接— — > 连接
连接— — > 连接
在你的Python环境中尝试以下方法：

#A list of words to be stemmed word_list = ["friend", "friendship", "friends", "friendships", "stabil", "destabilize", "misunderstanding", "railroad", "moonlight", "football"] print("{0:20}{1:20}{2:20}".format("Word", "Porter Stemmer", "lancaster Stemmer")) for word in word_list: print("{0:20}{1:20}{2:20}".format(word, porter.stem(word), lancaster.stem(word)))

WordPorter StemmerLancaster Stemmer friendfriendfriend friendshipfriendshipfriend friendsfriendfriend friendshipsfriendshipfriend stabilstabilstabl destabilizedestabildest misunderstandingmisunderstandmisunderstand railroadrailroadrailroad moonlightmoonlightmoonlight footballfootbalfootbal

LancasterStemmer(Paice-Husk词干分析器)是一种迭代算法, 规则在外部保存。一个表包含约120条规则, 这些规则由后缀的最后一个字母索引。在每次迭代中, 它将尝试通过单词的最后一个字符找到适用的规则。每个规则都指定删除或替换结尾。如果没有这样的规则, 则终止。如果单词以元音开头且仅剩两个字母, 或者单词以辅音开头且仅剩三个字符, 则该单词也终止。否则, 将应用规则, 然后重复该过程。

LancasterStemmer很简单, 但是可能会由于迭代和过度阻塞而产生大量阻塞。词干过度会导致词干不符合语言要求, 或者可能没有任何意义。

例如, 在上面的代码中, 不稳定的代码在LancasterStemmer中被阻止为dest, 而在PorterStemmer中则使用destabl。 LancasterStemmer会产生比搬运工更短的茎, 这是因为迭代和发生了过度梗塞。
摘句你可以使用nltk词干分析器来提取句子和文档。你可以按以下方式阻止句子：

sentence="Pythoners are very intelligent and work very pythonly and now they are pythoning their way to success." porter.stem(sentence)

'pythoners are very intelligent and work very pythonly and now they are pythoning their way to success.'

如你所见, 词干将整个句子视为一个单词, 因此它按原样返回。我们需要提取句子中的每个单词并返回一个组合的句子。要将句子分成单词, 可以使用标记器。 nltk分词器将句子分成以下单词。你可以创建一个函数, 然后将句子传递给该函数, 这将为你提供词干的句子。

from nltk.tokenize import sent_tokenize, word_tokenize def stemSentence(sentence): token_words=word_tokenize(sentence) token_words stem_sentence=[] for word in token_words: stem_sentence.append(porter.stem(word)) stem_sentence.append(" ") return "".join(stem_sentence)x=stemSentence(sentence) print(x)

python are veri intellig and work veri pythonli and now they are python their way to success .

装填文件你可以编写自己的函数以阻止文档。这是使用Python归档词干文档的一种方法：

将文档作为输入。
逐行阅读文档
标记线
词干
输出词干(在屏幕上打印或写入文件)
重复步骤2至步骤5, 直到到达文档末尾。

让我们做一些编码！打开一个文件, 任何文本文件。我在Python Notebook的工作目录中的名为” Stemming and Lemmatization” 的文件夹中有一个名为” data-science-wiki.txt” 的文本文件。如果它存储在任何其他目录中, 则必须在Python的open()命令中提供完整的文件路径。你可以在此处详细了解如何使用Python读写文件。

file=open("Stemming and Lemmatization\data-science-wiki.txt") file.read()

'Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, [1][2] similar to data mining. \nData science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. \nTuring award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.\nIn 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century", [6] the term "data science" became a buzzword. It is now often used interchangeably with earlier concepts like business analytics, [7] business intelligence, predictive modeling, and statistics. In many cases, earlier approaches and solutions are now simply rebranded as "data science" to be more attractive, which can cause the term to become "dilute[d] beyond usefulness."While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents.To its discredit, however, many data-science and big-data projects fail to deliver useful results, often as a result of poor management and utilization of resources. '

你可以完全使用.read()方法查看文件的内容。你可以使用.readlines()在Python列表中的文件中维护行。然后, 你可以使用该列表访问每一行, 并标记化和阻止所选行。

file=open("Stemming and Lemmatization\data-science-wiki.txt") my_lines_list=file.readlines() my_lines_list

['Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, [1][2] similar to data mining. \n', 'Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. \n', 'Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.\n', 'Data Science is now often used interchangeably with earlier concepts like business analytics, [7] business intelligence, predictive modeling, and statistics. In many cases, earlier approaches and solutions are now simply rebranded as "data science" to be more attractive, which can cause the term to become "dilute[d] beyond usefulness."While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents.To its discredit, however, many data-science and big-data projects fail to deliver useful results, often as a result of poor management and utilization of resources. ']

现在, 你可以访问每行, 并使用之前创建的标记化’ stemSentence()’ 函数来标记化和阻止该行。

from nltk.tokenize import sent_tokenize, word_tokenize from nltk.stem import PorterStemmerporter=PorterStemmer()def stemSentence(sentence): token_words=word_tokenize(sentence) token_words stem_sentence=[] for word in token_words: stem_sentence.append(porter.stem(word)) stem_sentence.append(" ") return "".join(stem_sentence)print(my_lines_list[0]) print("Stemmed sentence") x=stemSentence(my_lines_list[0]) print(x)

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, [1][2] similar to data mining.Stemmed sentence data scienc is an interdisciplinari field that use scientif method , process , algorithm and system to extract knowledg and insight from data in variou form , both structur and unstructur , [ 1 ] [ 2 ] similar to data mine .

你可以使用Python writelines()函数将词干句子保存到文本文件中。首先创建一个列表以存储所有词干的句子, 然后使用writelines()将列表简单地写入文件。

stem_file=open("Stemming and Lemmatization\stem-data-science-wiki.txt", mode="a+", encoding="utf-8") for line in my_lines_list: stem_sentence=stemSentence(line) stem_file.write(stem_sentence)stem_file.close()

创建的文本文件如下：

文章图片
NLTK语料库和词汇资源在本教程的这一部分中, 你将学习NLTK语料库以及如何使用它。
你可以使用NLTK文本语料库, 这是一个庞大的存储库, 用于存储称为语料库的大量文本, 可在使用Python进行自然语言处理(NLP)时使用。你可以将许多不同类型的语料库用于不同类型的项目, 例如, 精选的免费电子书, 网络和聊天文本以及不同类型的新闻文档。在这里, 你可以看到如何使用Python NLTK软件包中可用的不同语料库, 并且还提供了有用的代码, 可在你的项目中使用。
如果你以前从未在Python中使用过NLP, 则可能你的计算机上未安装任何copora。你可以运行nltk.download()命令, 并使用下载程序在corpora选项卡中安装所需的语料库。
注意：在NLTK下载器中, 通过单击文件-> 更改服务器地址来更改服务器地址, 并将http://nltk.org/nltk_data/粘贴在服务器地址文本框中；否则, 你可能无法下载语料库。
这是一个示例, 说明如何使用语料库和词干该文档：

import nltk nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml showing info http://nltk.org/nltk_data/True

nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

你可以使用以上任何文本文件进行词干分析。尝试如下所示：

text_file=nltk.corpus.gutenberg.words('melville-moby_dick.txt') my_lines_list=[] for line in text_file: my_lines_list.append(line) my_lines_list

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ', ', 'heart', ', ', 'body', ', ', 'and', 'brain', '; ', 'I', 'see', 'him', 'now', '.', 'He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ', ', 'with', 'a', 'queer', 'handkerchief', ', ', 'mockingly', 'embellished', 'with', 'all', 'the', 'gay', 'flags', 'of', 'all', 'the', 'known', 'nations', 'of', 'the', 'world', '.', 'He', 'loved', 'to', 'dust', 'his', 'old', 'grammars', '; ', 'it', 'somehow', 'mildly', 'reminded', 'him', 'of', 'his', 'mortality', '.', '"', 'While', 'you', 'take', 'in', 'hand', 'to', 'school', 'others', ', ', 'and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-', 'fish', 'is', 'to', 'be', 'called', 'in', 'our', 'tongue', 'leaving', 'out', ', ', 'through', 'ignorance', ', ', 'the', 'letter', 'H', ', ', 'which', 'almost', 'alone', 'maketh', 'the', 'signification', 'of', 'the', 'word', ', ', 'you', 'deliver', 'that', 'which', 'is', 'not', 'true', '."', '--', 'HACKLUYT', '"', 'WHALE', '.', '...', 'Sw', '.', 'and', 'Dan', '.', 'HVAL', '.', 'This', 'animal', 'is', 'named', 'from', 'roundness', 'or', 'rolling', '; ', 'for', 'in', 'Dan', '.', 'HVALT', 'is', 'arched', 'or', 'vaulted', '."', '--', 'WEBSTER', "'", 'S', 'DICTIONARY', '"', 'WHALE', '.', '...', 'It', 'is', 'more', 'immediately', 'from', 'the', 'Dut', '.', 'and', 'Ger', '.', 'WALLEN', '; ', 'A', '.', 'S', '.', 'WALW', '-', 'IAN', ', ', 'to', 'roll', ', ', 'to', 'wallow', '."', '--', 'RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ', ', 'GREEK', '.', 'CETUS', ', ', 'LATIN', '.', 'WHOEL', ', ', 'ANGLO', '-', 'SAXON', '.', 'HVALT', ', ', 'DANISH', '.', 'WAL', ', ', 'DUTCH', '.', 'HWAL', ', ', 'SWEDISH', '.', 'WHALE', ', ', 'ICELANDIC', '.', 'WHALE', ', ', 'ENGLISH', '.', 'BALEINE', ', ', 'FRENCH', '.', 'BALLENA', ', ', 'SPANISH', '.', 'PEKEE', '-', 'NUEE', '-', 'NUEE', ', ', 'FEGEE', '.', 'PEKEE', '-', 'NUEE', '-', 'NUEE', ', ', 'ERROMANGOAN', '.', 'EXTRACTS', '(', 'Supplied', 'by', 'a', 'Sub', '-', 'Sub', '-', 'Librarian', ').', 'It', 'will', 'be', 'seen', 'that', 'this', 'mere', 'painstaking', 'burrower', 'and', 'grub', '-', 'worm', 'of', 'a', 'poor', 'devil', 'of', 'a', 'Sub', '-', 'Sub', 'appears', 'to', 'have', 'gone', 'through', 'the', 'long', 'Vaticans', 'and', 'street', '-', 'stalls', 'of', 'the', 'earth', ', ', 'picking', 'up', 'whatever', 'random', 'allusions', 'to', 'whales', 'he', 'could', 'anyways', 'find', 'in', 'any', 'book', 'whatsoever', ', ', 'sacred', 'or', 'profane', '.', 'Therefore', 'you', 'must', 'not', ', ', 'in', 'every', 'case', 'at', 'least', ', ', 'take', 'the', 'higgledy', '-', 'piggledy', 'whale', 'statements', ', ', 'however', 'authentic', ', ', 'in', 'these', 'extracts', ', ', 'for', 'veritable', 'gospel', 'cetology', '.', 'Far', 'from', 'it', '.', 'As', 'touching', 'the', 'ancient', 'authors', 'generally', ', ', 'as', 'well', 'as', 'the', 'poets', 'here', 'appearing', ', ', 'these', 'extracts', 'are', 'solely', 'valuable', 'or', 'entertaining', ', ', 'as', 'affording', 'a', 'glancing', 'bird', "'", 's', 'eye', 'view', 'of', 'what', 'has', 'been', 'promiscuously', 'said', ', ', 'thought', ', ', 'fancied', ', ', 'and', 'sung', 'of', 'Leviathan', ', ', 'by', 'many', 'nations', 'and', 'generations', ', ', 'including', 'our', 'own', '.', 'So', 'fare', 'thee', 'well', ', ', 'poor', 'devil', 'of', 'a', 'Sub', '-', 'Sub', ', ', 'whose', 'commentator', 'I', 'am', '.', 'Thou', 'belongest', 'to', 'that', 'hopeless', ', ', 'sallow', 'tribe', 'which', 'no', 'wine', 'of', 'this', 'world', 'will', 'ever', 'warm', '; ', 'and', 'for', 'whom', 'even', 'Pale', 'Sherry', 'would', 'be', 'too', 'rosy', '-', 'strong', '; ', 'but', 'with', 'whom', 'one', 'sometimes', 'loves', 'to', 'sit', ', ', 'and', 'feel', 'poor', '-', 'devilish', ', ', 'too', '; ', 'and', 'grow', 'convivial', 'upon', 'tears', '; ', 'and', 'say', 'to', 'them', 'bluntly', ', ', 'with', 'full', 'eyes', 'and', 'empty', 'glasses', ', ', 'and', 'in', 'not', 'altogether', 'unpleasant', 'sadness', '--', 'Give', 'it', 'up', ', ', 'Sub', '-', 'Subs', '!', 'For', 'by', 'how', 'much', 'the', 'more', 'pains', 'ye', 'take', 'to', 'please', 'the', 'world', ', ', 'by', 'so', 'much', 'the', 'more', 'shall', 'ye', 'for', 'ever', 'go', 'thankless', '!', 'Would', 'that', 'I', 'could', 'clear', 'out', 'Hampton', 'Court', 'and', 'the', 'Tuileries', 'for', 'ye', '!', 'But', 'gulp', 'down', 'your', 'tears', 'and', 'hie', 'aloft', 'to', 'the', 'royal', '-', 'mast', 'with', 'your', 'hearts', '; ', 'for', 'your', 'friends', 'who', 'have', 'gone', 'before', 'are', 'clearing', 'out', 'the', 'seven', '-', 'storied', 'heavens', ', ', 'and', 'making', 'refugees', 'of', 'long', '-', 'pampered', 'Gabriel', ', ', 'Michael', ', ', 'and', 'Raphael', ', ', 'against', 'your', 'coming', '.', 'Here', 'ye', 'strike', 'but', 'splintered', 'hearts', 'together', '--', 'there', ', ', 'ye', 'shall', 'strike', 'unsplinterable', 'glasses', '!', 'EXTRACTS', '.', '"', 'And', 'God', 'created', 'great', 'whales', '."', '--', 'GENESIS', '.', '"', 'Leviathan', 'maketh', 'a', 'path', 'to', 'shine', 'after', 'him', '; ', 'One', 'would', 'think', 'the', 'deep', 'to', 'be', 'hoary', '."', '--', 'JOB', '.', '"', 'Now', 'the', 'Lord', 'had', 'prepared', 'a', 'great', 'fish', 'to', 'swallow', 'up', 'Jonah', '."', '--', 'JONAH', '.', '"', 'There', 'go', 'the', 'ships', '; ', 'there', 'is', 'that', 'Leviathan', 'whom', 'thou', 'hast', 'made', 'to', 'play', 'therein', '."', '--', 'PSALMS', '.', '"', 'In', 'that', 'day', ', ', 'the', 'Lord', 'with', 'his', 'sore', ', ', 'and', 'great', ', ', 'and', 'strong', 'sword', ', ', 'shall', 'punish', 'Leviathan', 'the', 'piercing', 'serpent', ', ', 'even', 'Leviathan', 'that', 'crooked', 'serpent', '; ', 'and', 'he', 'shall', 'slay', 'the', 'dragon', 'that', 'is', 'in', 'the', 'sea', '."', '--', 'ISAIAH', '"', 'And', 'what', 'thing', 'soever', 'besides', 'cometh', 'within', 'the', 'chaos', 'of', 'this', 'monster', "'", 's', 'mouth', ', ', 'be', 'it', 'beast', ', ', 'boat', ', ', 'or', 'stone', ', ', 'down', 'it', 'goes', 'all', 'incontinently', 'that', 'foul', 'great', 'swallow', 'of', 'his', ', ', 'and', 'perisheth', 'in', 'the', 'bottomless', 'gulf', 'of', 'his', 'paunch', '."', '--', 'HOLLAND', "'", 'S', 'PLUTARCH', "'", 'S', 'MORALS', '.', '"', 'The', 'Indian', 'Sea', 'breedeth', 'the', 'most', 'and', 'the', 'biggest', 'fishes', 'that', 'are', ':', 'among', 'which', 'the', 'Whales', 'and', 'Whirlpooles', 'called', 'Balaene', ', ', 'take', 'up', 'as', 'much', 'in', 'length', 'as', 'four', 'acres', 'or', 'arpens', 'of', 'land', '."', '--', 'HOLLAND', "'", 'S', 'PLINY', '.', '"', 'Scarcely', 'had', 'we', 'proceeded', 'two', 'days', 'on', 'the', 'sea', ', ', 'when', 'about', 'sunrise', 'a', 'great', 'many', 'Whales', 'and', 'other', 'monsters', 'of', 'the', 'sea', ', ', 'appeared', '.', 'Among', 'the', 'former', ', ', 'one', 'was', 'of', 'a', 'most', 'monstrous', 'size', '.', '...', 'This', 'came', 'towards', 'us', ', ', 'open', '-', 'mouthed', ', ', 'raising', 'the', 'waves', 'on', 'all', 'sides', ', ', 'and', 'beating', 'the', 'sea', 'before', 'him', 'into', 'a', 'foam', '."', '--', 'TOOKE', "'", 'S', 'LUCIAN', '.', '"', 'THE', 'TRUE', 'HISTORY', '."', '"', 'He', 'visited', 'this', 'country', 'also', 'with', 'a', 'view', 'of', 'catching', 'horse', '-', 'whales', ', ', 'which', 'had', 'bones', 'of', 'very', 'great', 'value', 'for', 'their', 'teeth', ', ', 'of', 'which', 'he', 'brought', 'some', 'to', 'the', 'king', '.', '...', 'The', 'best', 'whales', 'were', 'catched', 'in', 'his', 'own', 'country', ', ', 'of', 'which', 'some', 'were', 'forty', '-', 'eight', ', ', 'some', 'fifty', 'yards', 'long', '.', 'He', ...]

你可以阅读这些行, 并将这些行保存在上述的Python列表中, 并像上面部分中演示的那样使用该列表进行词干查找。
非英语选民
Python nltk不仅提供了两种英语词干：PorterStemmer和LancasterStemmer, 而且还提供了许多非英语词干, 作为SnowballStemmers, ISRIStemmer和RSLPSStemmer的一部分。 Python NLTK包含SnowballStemmers作为创建非英语词干的语言。一个人可以使用雪球编程自己的语言提取器。当前, 它支持以下语言：
雪球干

丹麦文
荷兰人
英语
法文
德语
匈牙利
义大利文
挪威
穿用
葡萄牙语
罗马尼亚语
俄语
西班牙文
瑞典

ISRIStemmer是阿拉伯语词干, 而RSLPStemmer是葡萄牙语的词干。
在本节中, 你将学习如何使用SnowballStemmer, 然后可以进一步结合在上一节中学到的内容, 以编写详细的代码。

from nltk.stem.snowball import SnowballStemmerenglishStemmer=SnowballStemmer("english") englishStemmer.stem("having")

'have'

你还可以告诉词干分析器忽略停用词。

停用词：停用词是不包含要在搜索查询中使用的重要意义的词。通常, 这些词会从搜索查询中过滤掉, 因为它们会返回大量不必要的信息。每种编程语言都会给出自己的停用词列表。通常, 它们是英语中常用的单词, 例如” as, the, be, are” 等。

你还需要使用nltk.download()下载停用词语料库, 就像你下载上面的gutenberg语料库一样。否则, 将给出资源未找到错误。

englishStemmer2=SnowballStemmer("english", ignore_stopwords=True) englishStemmer2.stem("having")

showing info http://nltk.org/nltk_data/'having'

你可以看到, 在使用ignore_stopwords = True之前, 已经阻止了它的使用, 但是在使用之后, 它会被阻止器忽略。

spanishStemmer=SnowballStemmer("spanish", ignore_stopwords=True) spanishStemmer.stem("Corriendo")

'corr'

这一切都是关于使用NLTK软件包在Python中进行词根提取。现在, 你将在下一部分中学习有关合法化的知识。
使用Python nltk包进行最小化

与词干法不同, 词法化处理会适当地减少变形词, 确保词根属于该语言。在词法化中, 词根称为词法。引理(复数引理或引理)是一组单词的规范形式, 字典形式或引文形式。

例如, runs, running和ran是run单词的所有形式, 因此run是所有这些单词的引理。由于词形化会返回该语言的实际单词, 因此在需要获取有效单词的地方使用它。
Python NLTK提供了WordNet Lemmatizer, 它使用WordNet数据库查找单词的词缀。
注意：在使用WordNet Lemmatizer之前, 请从NLTK下载器下载WordNet语料库。

import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer()sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun." punctuations="?:!., ; " sentence_words = nltk.word_tokenize(sentence) for word in sentence_words: if word in punctuations: sentence_words.remove(word)sentence_words print("{0:20}{1:20}".format("Word", "Lemma")) for word in sentence_words: print ("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word)))

WordLemma HeHe waswa runningrunning andand eatingeating atat samesame timetime HeHe hasha badbad habithabit ofof swimmingswimming afterafter playingplaying longlong hourshour inin thethe SunSun

在上面的输出中, 你一定想知道没有为任何单词给出实际的词根形式, 这是因为它们是在没有上下文的情况下给出的。你需要提供要进行词形化的上下文, 即词性(POS)。这是通过在wordnet_lemmatizer.lemmatize中提供pos参数的值来完成的。

for word in sentence_words: print ("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word, pos="v")))

HeHe wasbe runningrun andand eatingeat atat samesame timetime HeHe hashave badbad habithabit ofof swimmingswim afterafter playingplay longlong hourshours inin thethe SunSun

词根和词法的应用
词干和词法分解本身就是NLP的形式, 广泛用于文本挖掘中。文本挖掘是分析以自然语言编写的文本并从文本中提取高质量信息的过程。它涉及在文本中寻找有趣的模式, 或者从文本中提取数据以插入到数据库中。文本挖掘任务包括文本分类, 文本聚类, 概念/实体提取, 精细分类法的生成, 情感分析, 文档摘要和实体关系建模(即, 学习命名实体之间的关系)。开发人员必须使用词法分析, POS(词性)标记, 词干和其他自然语言处理技术来准备文本, 以从文本中获取有用的信息。
信息检索(IR)环境：当文档增加到令人难以置信的数字时, 使用词干和词形化将文档映射到常见主题并通过索引显示搜索结果很有用。查询扩展是搜索环境中使用的术语, 指的是用户输入查询时的含义。它用于扩展或增强查询以匹配其他文档。
词干分析已在诸如Web搜索引擎之类的查询系统中使用, 但是由于词干不足和词干过度问题, 发现返回正确结果的有效性受到限制。例如, 搜索” 市场营销” 的人可能对显示” 市场” 而不是市场营销的结果不满意。但是, 在其他语言中可能发现词干有用, 并且使用不同的词干算法可能会产生更好的输出。 Google搜索从2003年开始采用。
情绪分析情感分析是人们对某事的评论和评论的分析。它广泛用于在线零售商店中的产品分析。在分析之前, 词干和词法化是文本准备过程的一部分。
文件丛集文档聚类(或文本聚类)是对文本文档进行聚类分析的应用。它在自动文档组织, 主题提取以及快速信息检索或过滤中具有应用程序。文档聚类的示例包括用于搜索引擎的Web文档聚类。在应用聚类方法之前, 先通过标记化, 去除停用词然后进行词干和词法化来准备文档, 以减少执行相同信息的标记数, 从而加快整个过程。在此预处理之后, 通过计算所有令牌的频率来计算特征, 然后应用聚类方法。
阻止或残局化？
在阅读了整个教程之后, 你可能会问自己, 什么时候应该使用Stemming, 什么时候应该使用Lemmatization？答案本身就是从本教程中学到的。你已经看到以下几点：

词干和词尾词化均会产生变形词的词根形式。区别在于词干可能不是实际单词, 而引理是实际语言单词。
词干遵循一种算法, 该算法带有对单词执行的步骤, 从而使其更快。而在词素化中, 你还使用WordNet语料库和停用词的语料库来生成引理, 从而使引理比词干慢。你还必须定义词性以获得正确的引理。

那么什么时候使用什么！以上几点表明, 如果注重速度, 则应使用词干, 因为lemmatizers扫描了语料库, 这消耗了时间和处理时间。这取决于你正在处理的应用程序, 该应用程序决定了应使用茎秆提取器还是定形剂。如果要构建对语言很重要的语言应用程序, 则应使用lemmatization, 因为它使用语料库来匹配根形式。
教程到此结束！在本教程中, 你了解了NLP, Python NLTK软件包, 如何使用该软件包以及如何在Python中使用词汇资源。你了解了词干, 词法分解, 它们的应用程序以及如何在Python NLP应用程序中使用它们。学习愉快！
如果你想了解有关Python中自然语言处理的更多信息, 请参加srcmini的Python自然语言处理基础知识课程。
参考文献