零基础入门NLP_Task5_基于深度学习的文本分类2_Word2vec NLP

Task5 基于深度学习的文本分类2

Author: 2tong
与传统机器学习不同，深度学习既提供特征提取功能，也可以完成分类的功能。

Word2vec基础知识 word2vec模型背后的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本，我们选取一个上下文窗口和一个中心词，并基于这个中心词去预测窗口里其他词出现的概率。因此，word2vec模型可以方便地从新增语料中学习到新增词的向量表达，是一种高效的在线学习算法（online learning）。
word2vec的主要思路：通过单词和上下文彼此预测，对应的两个算法分别为：

Skip-grams (SG)：预测上下文
Continuous Bag of Words (CBOW)：预测目标单词
另外提出两种更加高效的训练方法：
Hierarchical softmax
Negative sampling

1.Skip-grams(SG) 从直观上理解，Skip-Gram是给定input word来预测上下文。
2.Continuous Bag of Words (CBOW) 从直观上理解，CBOW是给定上下文，来预测input word。
3. Hierarchical softmax 【零基础入门NLP_Task5_基于深度学习的文本分类2_Word2vec】为了避免要计算所有词的softmax概率，word2vec采样了霍夫曼树来代替从隐藏层到输出softmax层的映射。
霍夫曼树的建立：

根据标签（label）和频率建立霍夫曼树（label出现的频率越高，Huffman树的路径越短）
Huffman树中每一叶子结点代表一个label

4.Negative sampling 提出背景训练一个神经网络意味着要输入训练样本并且不断调整神经元的权重，从而不断提高对目标的准确预测。每当神经网络经过一个训练样本的训练，它的权重就会进行一次调整。
所以，词典的大小决定了我们的Skip-Gram神经网络将会拥有大规模的权重矩阵，所有的这些权重需要通过数以亿计的训练样本来进行调整，这是非常消耗计算资源的，并且实际中训练起来会非常慢。
为解决这个问题，提出Negative Sampling。它是用来提高训练速度并且改善所得到词向量的质量的一种方法。不同于原本每个训练样本更新所有的权重，负采样每次让一个训练样本仅仅更新一小部分的权重，这样就会降低梯度下降过程中的计算量。
Word2vec模型训练 1.参考链接

gensim
论坛参考代码

2.实现代码

import loggingimport numpy as np import pandas as pd from gensim.models.word2vec import Word2Vecif __name__ == '__main__': logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')# split data to 10 fold fold_num = 10 data_file = '/home/2tong/data/train_set.csv' fold_data = https://www.it610.com/article/all_data2fold(fold_num)# build train data for word2vec fold_id = fold_num-1train_texts = [] for i in range(0, fold_id): data = fold_data[i] train_texts.extend(data['text'])logging.info('Total %d docs.' % len(train_texts)) logging.info('Start training...')num_features = 100# Word vector dimensionality num_workers = 8# Number of threads to run in paralleltrain_texts = list(map(lambda x: list(x.split()), train_texts)) model = Word2Vec(train_texts, workers=num_workers, size=num_features) model.init_sims(replace=True)# save model model.save("./word2vec_model/word2vec.bin")# convert format model.wv.save_word2vec_format('./word2vec_model/word2vec.txt', binary=False)

3.日志打印

2020-07-31 21:28:57,521 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000] 2020-07-31 21:28:57,543 INFO: Total 9000 docs. 2020-07-31 21:28:57,543 INFO: Start training... 2020-07-31 21:28:58,221 INFO: collecting all words and their counts 2020-07-31 21:28:58,221 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2020-07-31 21:28:59,413 INFO: collected 5290 word types from a corpus of 8223375 raw words and 9000 sentences 2020-07-31 21:28:59,413 INFO: Loading a fresh vocabulary 2020-07-31 21:28:59,507 INFO: effective_min_count=5 retains 4324 unique words (81% of original 5290, drops 966) 2020-07-31 21:28:59,507 INFO: effective_min_count=5 leaves 8221380 word corpus (99% of original 8223375, drops 1995) 2020-07-31 21:28:59,519 INFO: deleting the raw counts dictionary of 5290 items 2020-07-31 21:28:59,520 INFO: sample=0.001 downsamples 61 most-common words 2020-07-31 21:28:59,520 INFO: downsampling leaves estimated 7098252 word corpus (86.3% of prior 8221380) 2020-07-31 21:28:59,528 INFO: estimated required memory for 4324 words and 100 dimensions: 5621200 bytes 2020-07-31 21:28:59,528 INFO: resetting layer weights 2020-07-31 21:29:00,375 INFO: training model with 8 workers on 4324 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 2020-07-31 21:29:01,384 INFO: EPOCH 1 - PROGRESS: at 25.21% examples, 1784282 words/s, in_qsize 16, out_qsize 1 ... 2020-07-31 21:29:04,402 INFO: EPOCH 1 - PROGRESS: at 91.77% examples, 1617927 words/s, in_qsize 16, out_qsize 0 2020-07-31 21:29:04,699 INFO: worker thread finished; awaiting finish of 7 more threads ... 2020-07-31 21:29:04,732 INFO: EPOCH - 1 : training on 8223375 raw words (7061176 effective words) took 4.3s, 1623264 effective words/s 2020-07-31 21:29:05,743 INFO: EPOCH 2 - PROGRESS: at 21.42% examples, 1504059 words/s, in_qsize 15, out_qsize 0 ... 2020-07-31 21:29:08,767 INFO: EPOCH 2 - PROGRESS: at 84.64% examples, 1481574 words/s, in_qsize 16, out_qsize 0 2020-07-31 21:29:09,398 INFO: worker thread finished; awaiting finish of 7 more threads ... 2020-07-31 21:29:09,419 INFO: EPOCH - 2 : training on 8223375 raw words (7062608 effective words) took 4.7s, 1508810 effective words/s 2020-07-31 21:29:10,430 INFO: EPOCH 3 - PROGRESS: at 23.41% examples, 1645736 words/s, in_qsize 16, out_qsize 0 ... 2020-07-31 21:29:13,886 INFO: worker thread finished; awaiting finish of 7 more threads ... 2020-07-31 21:29:13,909 INFO: EPOCH - 3 : training on 8223375 raw words (7062543 effective words) took 4.5s, 1574839 effective words/s 2020-07-31 21:29:14,923 INFO: EPOCH 4 - PROGRESS: at 18.72% examples, 1321168 words/s, in_qsize 15, out_qsize 0 ... 2020-07-31 21:29:17,946 INFO: EPOCH 4 - PROGRESS: at 86.46% examples, 1514607 words/s, in_qsize 15, out_qsize 0 2020-07-31 21:29:18,515 INFO: worker thread finished; awaiting finish of 7 more threads ... 2020-07-31 21:29:18,544 INFO: EPOCH - 4 : training on 8223375 raw words (7060892 effective words) took 4.6s, 1524940 effective words/s 2020-07-31 21:29:19,559 INFO: EPOCH 5 - PROGRESS: at 21.06% examples, 1472794 words/s, in_qsize 13, out_qsize 2 ... 2020-07-31 21:29:22,568 INFO: EPOCH 5 - PROGRESS: at 88.28% examples, 1552921 words/s, in_qsize 14, out_qsize 1 2020-07-31 21:29:23,043 INFO: worker thread finished; awaiting finish of 7 more threads ... 2020-07-31 21:29:23,063 INFO: EPOCH - 5 : training on 8223375 raw words (7061288 effective words) took 4.5s, 1564422 effective words/s 2020-07-31 21:29:23,064 INFO: training on a 41116875 raw words (35308507 effective words) took 22.7s, 1556223 effective words/s 2020-07-31 21:29:23,064 INFO: precomputing L2-norms of word weight vectors 2020-07-31 21:29:23,068 INFO: saving Word2Vec object under ./word2vec_model/word2vec.bin, separately None 2020-07-31 21:29:23,069 INFO: not storing attribute vectors_norm 2020-07-31 21:29:23,069 INFO: not storing attribute cum_table 2020-07-31 21:29:23,130 INFO: saved ./word2vec_model/word2vec.bin 2020-07-31 21:29:23,131 INFO: storing 4324x100 projection weights into ./word2vec_model/word2vec.txt

4.生成目录