20210611 word2vec 的代码实现 _word2vec

【20210611 word2vec 的代码实现】会挽雕弓如满月，西北望，射天狼。这篇文章主要讲述20210611 word2vec 的代码实现相关的知识，希望能为你提供帮助。
使用第三方包进行词向量的具体实现，Word2Vec 是一种词嵌入（Word Embedding）方法；它可以计算每个单词在其给定语料库环境下的分布式词向量（Distributed Representation，亦直接被称为词向量）。词向量表示可以在一定程度上刻画每个单词的语义。

1 简单用法
1-1 读取语料
有 3 种方式
1 语料可以存储在内存中，格式为[[word1,word2,word3...],[word1,word2,word3...],...]，列表中每一个子列表为分完词的一篇文档
2 通过 LineSentence 的方式
class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)；source为可读文件路径，文件每一行代表一篇文档，文档是已经经过分词，每个词由空格分隔。max_sentence_length为文章的最大长度，limit为读取前多少篇文档(即前多少行)
class gensim.models.word2vec.PathLineSentences（source，max_sentence_length = 10000，limit = None ）
3 这种方式用的比较少，与LineSentence类似，不过这里传入的是根目录，目录下有多个可读文件；文件格式需要和与LineSentence所需文件格式类似，此函数可处理根目录下所有的文件。

import jieba from gensim.models import word2vec

1-1-2 内存方式

# 加载自定义词典 jieba.load_userdict("MobilePhone_Userdict.txt") # 将停用词读出放在stopwords这个列表中 filepath = r\'stopwords.txt\' stopwords = [line.strip() for line in open(filepath, \'r\', encoding=\'utf-8\').readlines()]

# 读取文件，将其中句子进行分词 def readfile2wordlist(file_path): cut_word_list = [] with open(file_path, \'r\', encoding="utf-8-sig") as f: for line in f.readlines(): line = line.strip() seg_list = jieba.cut(line) seg_list = [i for i in seg_list if i not in stopwords and i!=\' \'] cut_word_list.append(seg_list) return cut_word_list

# 未分词语料 file_path = \'mb.txt\' sentences = readfile2wordlist(file_path) print(sentences[:10])

1-1-3 文件方式

file_path = \'mb_train.txt\' # 使用LineSentence读取语料 sentences = word2vec.LineSentence(file_path,max_sentence_length=10000, limit=None) print(sentences)

-->
< gensim.models.word2vec.LineSentence object at 0x0000021AEDAD47C0>
如果查看 sentences 中的具体值，需要用 for 循环，类似生成器的感觉

for doument in sentences: print(doument) # 得到的结果和内存方式是一样的

2-1 训练word2vec语义向量
# 训练时需要将 word2vec 改成 Word2Vec
# class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,
#max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
#sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
#trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
#max_final_vocab=None)
#sentence(iterable of iterables):输入语料，与我们上面生成的一致
#SG(INT {1 ，0}) -定义的训练算法。如果是1，则使用skip-gram; 否则，使用CBOW。
#hs：是否采用基于Hierarchical Softmax的模型。参数为1表示使用，0表示不使用
#size(int) - 特征向量的维数。
#window(int) - 句子中当前词和预测词之间的最大距离。
#min_count(int) - 忽略总频率低于此值的所有单词。
# 执行这行后，训练就完成了；sentences 是构建的语料；size是训练后，词向量的大小；alpha是学习率
# window是当前词和当前句子中周边词的最远距离； min_count 是如果频数小于min_count时，就不计算

model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=200)

2-2 保存模型
# model.save(file_name)# file_name:存储模型的名称

model.save(\'mb_word2vec.bin\')

2-3 加载模型
# word2vec.Word2Vec.load(file_name)# file_name:存储的模型的名称

model = word2vec.Word2Vec.load(\'mb_word2vec.bin\')

2-4

# 获取词表 print(model.wv.index2word) # 获取单词word2vec值 model[\'Apple\'] # 获取单词word2vec值 model[\'sudo\']# 计算两个单词的语义相似度 print(model.similarity(\'安卓\',\'苹果\')) print(model.similarity(\'金立\',\'小米\'))

部分代码解释：
1. strip()
https://blog.51cto.com/u_15149862/2812172
2. print(sentences[:10])
https://blog.51cto.com/u_15149862/2704954

部分理论说明：
什么是 Word2Vec？
https://blog.51cto.com/u_15149862/2897151