
英文文本处理与spaCy spaCy是Python和Cython中的高级自然语言处理库,它建立在最新的研究基础之上,从一开始就设计用于实际产品。spaCy 带有预先训练的统计模型和单词向量,目前支持 20 多种语言的标记。它具有世界上速度最快的句法分析器,用于标签的卷积神经网络模型,解析和命名实体识别以及与深度学习整合。


import spacy nlp = spacy.load('en') doc = nlp('Hello World! My name is HanXiaoyang') for token in doc: print('"' + token.text + '"')->"Hello" "World" "!" "My" "name" "is" "HanXiaoyang"

doc = nlp("Next week I'llbe in Shanghai.") for token in doc: print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format( token.text, token.idx, token.lemma_, token.is_punct, token.is_space, token.shape_, token.pos_, token.tag_ ))->Next 0 next False False Xxxx ADJ JJ week 5 week False False xxxx NOUN NN I 10 -PRON- False False X PRON PRP 'll 11 will False False 'xx VERB MD 15False TrueSPACE _SP be 17 be False False xx VERB VB in 20 in False False xx ADP IN Shanghai 23 shanghai False False Xxxxx PROPN NNP . 31 . True False . PUNCT

# 断句 doc = nlp("Hello World! My name is HanXiaoyang") for sent in doc.sents: print(sent)->Hello World! My name is HanXiaoyang

词性标注(part-of-speech tagging),又称为词类标注或者简称标注,是指为分词结果中的每个单词标注一个正确的词性的程序,也即确定每个词是名词、动词、形容词或者其他词性的过程。
  • 基于最大熵的词性标注
  • 基于统计最大概率输出词性
  • 基于HMM的词性标注
# 词性标注 doc = nlp("Next week I'll be in Shanghai.") print([(token.text, token.tag_) for token in doc])->[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Shanghai', 'NNP'), ('.', '.')]

POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there, is
FW foreign word d’hoevre
IN preposition or subordinating conjunction in, of, like
JJ adjective big
JJR adjective, comparative bigger
JJS adjective, superlative biggest
LS list marker 1)
MD modal could, will
NN noun, singular or mass door
NNS noun plural doors
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend‘s
PRP personal pronoun I, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund or present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
命名实体识别(Named Entity Recognition,简称NER),又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。通常包括两部分:1) 实体边界识别;2) 确定实体类别(人名、地名、机构名或其他)。
doc = nlp("Next week I'll be in Shanghai.") for ent in doc.ents: print(ent.text, ent.label_)>>Next week DATE Shanghai GPE

from nltk.chunk import conlltags2treedoc = nlp("Next week I'll be in Shanghai.") iob_tagged = [ ( token.text, token.tag_, "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_ ) for token in doc ] print(iob_tagged) # 按照nltk.Tree的格式显示 print(conlltags2tree(iob_tagged))>>[('Next', 'JJ', 'B-DATE'), ('week', 'NN', 'I-DATE'), ('I', 'PRP', 'O'), ("'ll", 'MD', 'O'), ('be', 'VB', 'O'), ('in', 'IN', 'O'), ('Shanghai', 'NNP', 'B-GPE'), ('.', '.', 'O')] (S (DATE Next/JJ week/NN) I/PRP 'll/MD be/VB in/IN (GPE Shanghai/NNP) ./.)

doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ") for ent in doc.ents: print(ent.text, ent.label_) >>2 CARDINAL 9 a.m. TIME 30% PERCENT just 2 days DATE WSJ ORG

from spacy import displacy doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ') displacy.render(doc, style='ent', jupyter=True)


doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies") for chunk in doc.noun_chunks: print(chunk.text, chunk.label_, chunk.root.text)>>Wall Street Journal NP Journal an interesting piece NP piece crypto currencies NP currencies

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies') for token in doc: print("{0}/{1} <--{2}-- {3}/{4}".format( token.text, token.tag_, token.dep_, token.head.text, token.head.tag_)) >>Wall/NNP <--compound-- Street/NNP Street/NNP <--compound-- Journal/NNP Journal/NNP <--nsubj-- published/VBD just/RB <--advmod-- published/VBD published/VBD <--ROOT-- published/VBD an/DT <--det-- piece/NN interesting/JJ <--amod-- piece/NN piece/NN <--dobj-- published/VBD on/IN <--prep-- piece/NN crypto/JJ <--compound-- currencies/NNS currencies/NNS <--pobj-- on/IN

from spacy import displacy doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies') displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})>>Wall PROPN Street PROPN Journal PROPN just ADV published VERB an DET interesting ADJ piece NOUN on ADP crypto ADJ currencies NOUN compound compound nsubj advmod det amod dobj prep compound pobj

NLP中有一个非常强大的文本表示学习方法叫做word2vec,通过词的上下文学习到词语的稠密向量化表示,同时在这个表示形态下,语义相关的词在向量空间中会比较接近。也有类似v(爷爷)-v(奶奶) ≈ v(男人)-v(女人)的关系。
命令:python3 -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg') print(nlp.vocab['banana'].vector[0:10])>>[ 0.20228-0.0766180.370320.032845 -0.419570.072069 -0.37476 0.05746-0.0124010.52949 ]

from scipy import spatial# 余弦相似度计算 cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)# 男人、女人、国王、女王 的词向量 man = nlp.vocab['man'].vector woman = nlp.vocab['woman'].vector queen = nlp.vocab['queen'].vector king = nlp.vocab['king'].vector # 我们对向量做一个简单的计算,"man" - "woman" + "queen" maybe_king = man - woman + queen computed_similarities = []# 扫描整个词库的词向量做比对,召回最接近的词向量 for word in nlp.vocab: if not word.has_vector: continue similarity = cosine_similarity(maybe_king, word.vector) computed_similarities.append((word, similarity))# 排序与最接近结果展示 computed_similarities = sorted(computed_similarities, key=lambda item: -item[1]) print([w[0].text for w in computed_similarities[:10]])>>['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'Kings', 'KINGS', 'kings']

# 词汇语义相似度(关联性) banana = nlp.vocab['banana'] dog = nlp.vocab['dog'] fruit = nlp.vocab['fruit'] animal = nlp.vocab['animal'] print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845 print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285

# 文本语义相似度(关联性) target = nlp("Cats are beautiful animals.") doc1 = nlp("Dogs are awesome.") doc2 = nlp("Some gorgeous creatures are felines.") doc3 = nlp("Dolphins are swimming mammals.") print(target.similarity(doc1))# 0.8901765218466683 print(target.similarity(doc2))# 0.9115828449161616 print(target.similarity(doc3))# 0.7822956752876101
