别裁伪体亲风雅,转益多师是汝师。这篇文章主要讲述自然语言处理(NLP)——分词统计itertools.chain—nltk工具相关的知识,希望能为你提供帮助。
文章目录
- ??一、itertools.chain( *[ ] )??
- ??二、NLTK工具:条件频率分布、正则表达式、词干提取器和归并器。??
- ??2.1 nltk 分句—分词??
- ??Sentences Segment(分句)??
- ??Tokenize sentences (分词)??
- ??2.2 nltk提供了两种常用的接口:`FreqDist` 和 `ConditionalFreqDist`??
- ??`FreqDist` 使用??
- ??`ConditionalFreqDist` 使用??
- ??2.3 正则表达式及其应用??
- ??2.4 词干提取器 和 归并器??
- ??利用词干提取器实现索引文本(concordance)??
import itertools
a= itertools.chain([a,aa,aaa])
b= itertools.chain(*[a,aa,aaa])
print(list(a))
print(list(b))
输出:
[‘a’, ‘aa’, ‘aaa’]
[‘a’, ‘a’, ‘a’, ‘a’, ‘a’, ‘a’]
二、NLTK工具:条件频率分布、正则表达式、词干提取器和归并器。
2.1 nltk 分句—分词
?
?nltk.sent_tokenize(text)?
? #对文本按照句子进行分割?
?nltk.word_tokenize(sent)?
? #对句子进行分词NLTK进行词性标注 ?
?nltk.pos_tag(tokens)?
? #tokens是句子分词后的结果,同样是句子级的标注NLTK进行命名实体识别(NER) ?
?nltk.ne_chunk(tags)?
? #tags是句子词性标注后的结果,同样是句子级Sentences Segment(分句)
sent_tokenizer = nltk.data.load(tokenizers/punkt/english.pickle)
paragraph = "The first time I heard that song was in Hawaii on radio.
I was just a kid, and loved it very much! What a fantastic song!"
print(sent_tokenizer.tokenize(paragraph))
输出:
[The first time I heard that song was in Hawaii on radio.,
I was just a kid, and loved it very much!,
What a fantastic song!]
Tokenize sentences (分词)
from nltk.tokenize import WordPunctTokenizer
sentence = "Are you old enough to remember Michael Jackson attending
the Grammys with Brooke Shields and Webster sat on his lap during the show?"
print(WordPunctTokenizer().tokenize(sentence))
输出:
[Are, you, old, enough, to, remember, Michael, Jackson, attending,
the, Grammys, with, Brooke, Shields, and, Webster, sat, on, his,
lap, during, the, show, ?]
----------------------------------------------------
text = That U.S.A. poster-print costs $12.40...
pattern = r"""(?x)# set flag to allow verbose regexps
(?:[A-Z]\\.)+# abbreviations, e.g. U.S.A.
|\\d+(?:\\.\\d+)?%?# numbers, incl. currency and percentages
|\\w+(?:[-]\\w+)*# words w/ optional internal hyphens/apostrophe
|\\.\\.\\.# ellipsis
|(?:[.,; "?():-_`])# special characters with meanings
"""
nltk.regexp_tokenize(text, pattern)
[That, U.S.A., poster-print, costs, 12.40, ...]
2.2 nltk提供了两种常用的接口:?
?FreqDist?
?? 和 ??ConditionalFreqDist?
?
??FreqDist?
? 使用from nltk import *
import matplotlib.pyplot as plt
tem = [hello,world,hello,dear]
print(FreqDist(tem))
输出:
FreqDist(dear: 1, hello: 2, world: 1)
通过 plot(TopK,cumulative=True) 和 tabulate()
?
?ConditionalFreqDist?
? 使用以一个配对链表作为输入,需要给分配的每个事件关联一个条件,输入时类似于 (条件,事件) 的元组。
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre,word) \\
for genre in brown.categories()\\
for word in brown.words(categories=genre))
print("conditions are:",cfd.conditions()) #查看conditions
print(cfd[news])
print(cfd[news][could])#类似字典查询
输出:
conditions are: [adventure, belles_lettres, editorial, fiction,
government, hobbies, humor, learned, lore, mystery,
news, religion, reviews, romance, science_fiction]
< FreqDist with 14394 samples and 100554 outcomes>
86
"""
尤其对于plot() 和 tabulate() 有了更多参数选择:
conditions:指定条件
samples:迭代器类型,指定取值范围
cumulative:设置为True可以查看累积值
"""
cfd.tabulate(conditions=[news,romance],samples=[could,can])
cfd.tabulate(conditions=[news,romance],samples=[could,can],cumulative=True)
输出:
couldcan
news8693
romance19374
couldcan
news86179
romance193267
2.3 正则表达式及其应用输入法联想提示(9宫格输入法)
import re
from nltk.corpus import words
#查找类似于hole和golf序列(4653)的单词。
wordlist = [w for w in words.words(en-basic) if w.islower()]
same = [w for w in wordlist if re.search(r^[ghi][mno][jlk][def]$,w)]
print(same)
寻找字符块 —查找两个或两个以上的元音序列,并且确定相对频率。
import nltk
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r[aeiou]2,,word))
fd.items()
查找词干—apples和apple对比中,apple就是词干。写一个简单脚本来查询词干。
def stem(word):
for suffix in [ing,ly,ed,ious,ies,ive,es,s,ment]:
if word.endswith(suffix):
return word[:-len(suffix)]
return None
或者使用正则表达式,只需要一行:
re.findall(r^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$,word)
2.4 词干提取器 和 归并器nltk提供了?
?PorterStemmer?
?? 和 ??LancasterStemmer?
?? 两个词干提取器,Porter比较好,可以处理lying这样的单词。
porter = nltk.PorterStemmer()
print(porter.stem(lying))
---------------------------------------
词性归并器:WordNetLemmatizer
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize(women))
利用词干提取器实现索引文本(concordance)用到nltk.Index这个函数:?
?nltk.Index((word , i) for (i,word) in enumerate([a,b,a]))?
?class IndexText:
def __init__(self,stemmer,text):
self._text = text
self._stemmer = stemmer
self._index = nltk.Index((self._stem(word),i) for (i,word) in enumerate(text))
def _stem(self,word):
return self._stemmer.stem(word).lower()
def concordance(self,word,width =40):
key = self._stem(word)
wc = width/4 #words of context
for i in self._index[key]:
lcontext =.join(self._text[int(i-wc):int(i)])
rcontext =.join(self._text[int(i):int(i+wc)])
ldisplay = %*s % (width,lcontext[-width:])
rdisplay = %-*s % (width,rcontext[:width])
print(ldisplay,rdisplay)
porter = nltk.PorterStemmer()#词干提取
grail = nltk.corpus.webtext.words(grail.txt)
text = IndexText(porter,grail)
text.concordance(lie)
【自然语言处理(NLP)——分词统计itertools.chain—nltk工具】
推荐阅读
- 基于物品—SVD餐馆评分估计值
- OpenCV—Python PyLibTiff_psd 图像基本操作以及图像格式转换
- CAD绘制圆形云线批注(网页版)
- HTTPS域名/网址 ssl证书 加了CDN后,域名一定解析@和www
- Nginx 反向代理
- tp5.1 layui 数据太多造成列表读取错误(内存溢出)
- tp5.1 打开网址 输出的是网页代码输出 没有转变过 没有渲染(模板 return $this-;fetch() return view();)
- 点播 构造自己的播放器 用户调用获取视频播放地址接口
- 阿里云 cdn 域名的配置方法