python|python sklearn TfidfVectorizer

【python|python sklearn TfidfVectorizer】参考:http://python.jobbole.com/81311/

# -*- coding:utf-8 -*-from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer import math import numpy as npcorpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?',] vectorizer = TfidfVectorizer(min_df=1) vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(TfidfVectorizer().fit(corpus).vocabulary_) print(TfidfVectorizer().fit(corpus).idf_) print(TfidfVectorizer().fit(corpus).smooth_idf) x = TfidfVectorizer().fit(corpus) # print(x.transform(corpus).toarray()) # print(vectorizer.fit_transform(corpus)) print(vectorizer.fit_transform(corpus).toarray())

结果:
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this'] {u'and': 0, u'third': 7, u'this': 8, u'is': 3, u'one': 4, u'second': 5, u'the': 6, u'document': 1, u'first': 2} [1.91629073 1.22314355 1.51082562 1.22314355 1.91629073 1.91629073 1.1.91629073 1.22314355] True [[0.0.43877674 0.54197657 0.43877674 0.0. 0.35872874 0.0.43877674] [0.0.27230147 0.0.27230147 0.0.85322574 0.22262429 0.0.27230147] [0.55280532 0.0.0.0.55280532 0. 0.28847675 0.55280532 0.] [0.0.43877674 0.54197657 0.43877674 0.0. 0.35872874 0.0.43877674]]

最后计算结果和手算的会不一样。
可以看到idf的结果中the的为1,是因为在所有文档中均出现了,其它词的结果以这个为标准。
还可以发现结果中,每一行的数的平方和都为1。
并且0.43877674/0.35872874 = 1.22314355,0.85322574/0.22262429 = 1.91629073×2。

    推荐阅读