Task5 基于深度学习的文本分类2
Author: 2tongWord2vec基础知识 word2vec模型背后的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本,我们选取一个上下文窗口和一个中心词,并基于这个中心词去预测窗口里其他词出现的概率。因此,word2vec模型可以方便地从新增语料中学习到新增词的向量表达,是一种高效的在线学习算法(online learning)。
与传统机器学习不同,深度学习既提供特征提取功能,也可以完成分类的功能。
word2vec的主要思路:通过单词和上下文彼此预测,对应的两个算法分别为:
- Skip-grams (SG):预测上下文
- Continuous Bag of Words (CBOW):预测目标单词
另外提出两种更加高效的训练方法: - Hierarchical softmax
- Negative sampling
2.Continuous Bag of Words (CBOW) 从直观上理解,CBOW是给定上下文,来预测input word。
3. Hierarchical softmax 【零基础入门NLP_Task5_基于深度学习的文本分类2_Word2vec】为了避免要计算所有词的softmax概率,word2vec采样了霍夫曼树来代替从隐藏层到输出softmax层的映射。
霍夫曼树的建立:
- 根据标签(label)和频率建立霍夫曼树(label出现的频率越高,Huffman树的路径越短)
- Huffman树中每一叶子结点代表一个label
所以,词典的大小决定了我们的Skip-Gram神经网络将会拥有大规模的权重矩阵,所有的这些权重需要通过数以亿计的训练样本来进行调整,这是非常消耗计算资源的,并且实际中训练起来会非常慢。
为解决这个问题,提出Negative Sampling。它是用来提高训练速度并且改善所得到词向量的质量的一种方法。不同于原本每个训练样本更新所有的权重,负采样每次让一个训练样本仅仅更新一小部分的权重,这样就会降低梯度下降过程中的计算量。
Word2vec模型训练 1.参考链接
- gensim
- 论坛参考代码
import loggingimport numpy as np
import pandas as pd
from gensim.models.word2vec import Word2Vecif __name__ == '__main__':
logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')# split data to 10 fold
fold_num = 10
data_file = '/home/2tong/data/train_set.csv'
fold_data = https://www.it610.com/article/all_data2fold(fold_num)# build train data for word2vec
fold_id = fold_num-1train_texts = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])logging.info('Total %d docs.' % len(train_texts))
logging.info('Start training...')num_features = 100# Word vector dimensionality
num_workers = 8# Number of threads to run in paralleltrain_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)# save model
model.save("./word2vec_model/word2vec.bin")# convert format
model.wv.save_word2vec_format('./word2vec_model/word2vec.txt', binary=False)
3.日志打印
2020-07-31 21:28:57,521 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
2020-07-31 21:28:57,543 INFO: Total 9000 docs.
2020-07-31 21:28:57,543 INFO: Start training...
2020-07-31 21:28:58,221 INFO: collecting all words and their counts
2020-07-31 21:28:58,221 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 21:28:59,413 INFO: collected 5290 word types from a corpus of 8223375 raw words and 9000 sentences
2020-07-31 21:28:59,413 INFO: Loading a fresh vocabulary
2020-07-31 21:28:59,507 INFO: effective_min_count=5 retains 4324 unique words (81% of original 5290, drops 966)
2020-07-31 21:28:59,507 INFO: effective_min_count=5 leaves 8221380 word corpus (99% of original 8223375, drops 1995)
2020-07-31 21:28:59,519 INFO: deleting the raw counts dictionary of 5290 items
2020-07-31 21:28:59,520 INFO: sample=0.001 downsamples 61 most-common words
2020-07-31 21:28:59,520 INFO: downsampling leaves estimated 7098252 word corpus (86.3% of prior 8221380)
2020-07-31 21:28:59,528 INFO: estimated required memory for 4324 words and 100 dimensions: 5621200 bytes
2020-07-31 21:28:59,528 INFO: resetting layer weights
2020-07-31 21:29:00,375 INFO: training model with 8 workers on 4324 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 21:29:01,384 INFO: EPOCH 1 - PROGRESS: at 25.21% examples, 1784282 words/s, in_qsize 16, out_qsize 1
...
2020-07-31 21:29:04,402 INFO: EPOCH 1 - PROGRESS: at 91.77% examples, 1617927 words/s, in_qsize 16, out_qsize 0
2020-07-31 21:29:04,699 INFO: worker thread finished;
awaiting finish of 7 more threads
...
2020-07-31 21:29:04,732 INFO: EPOCH - 1 : training on 8223375 raw words (7061176 effective words) took 4.3s, 1623264 effective words/s
2020-07-31 21:29:05,743 INFO: EPOCH 2 - PROGRESS: at 21.42% examples, 1504059 words/s, in_qsize 15, out_qsize 0
...
2020-07-31 21:29:08,767 INFO: EPOCH 2 - PROGRESS: at 84.64% examples, 1481574 words/s, in_qsize 16, out_qsize 0
2020-07-31 21:29:09,398 INFO: worker thread finished;
awaiting finish of 7 more threads
...
2020-07-31 21:29:09,419 INFO: EPOCH - 2 : training on 8223375 raw words (7062608 effective words) took 4.7s, 1508810 effective words/s
2020-07-31 21:29:10,430 INFO: EPOCH 3 - PROGRESS: at 23.41% examples, 1645736 words/s, in_qsize 16, out_qsize 0
...
2020-07-31 21:29:13,886 INFO: worker thread finished;
awaiting finish of 7 more threads
...
2020-07-31 21:29:13,909 INFO: EPOCH - 3 : training on 8223375 raw words (7062543 effective words) took 4.5s, 1574839 effective words/s
2020-07-31 21:29:14,923 INFO: EPOCH 4 - PROGRESS: at 18.72% examples, 1321168 words/s, in_qsize 15, out_qsize 0
...
2020-07-31 21:29:17,946 INFO: EPOCH 4 - PROGRESS: at 86.46% examples, 1514607 words/s, in_qsize 15, out_qsize 0
2020-07-31 21:29:18,515 INFO: worker thread finished;
awaiting finish of 7 more threads
...
2020-07-31 21:29:18,544 INFO: EPOCH - 4 : training on 8223375 raw words (7060892 effective words) took 4.6s, 1524940 effective words/s
2020-07-31 21:29:19,559 INFO: EPOCH 5 - PROGRESS: at 21.06% examples, 1472794 words/s, in_qsize 13, out_qsize 2
...
2020-07-31 21:29:22,568 INFO: EPOCH 5 - PROGRESS: at 88.28% examples, 1552921 words/s, in_qsize 14, out_qsize 1
2020-07-31 21:29:23,043 INFO: worker thread finished;
awaiting finish of 7 more threads
...
2020-07-31 21:29:23,063 INFO: EPOCH - 5 : training on 8223375 raw words (7061288 effective words) took 4.5s, 1564422 effective words/s
2020-07-31 21:29:23,064 INFO: training on a 41116875 raw words (35308507 effective words) took 22.7s, 1556223 effective words/s
2020-07-31 21:29:23,064 INFO: precomputing L2-norms of word weight vectors
2020-07-31 21:29:23,068 INFO: saving Word2Vec object under ./word2vec_model/word2vec.bin, separately None
2020-07-31 21:29:23,069 INFO: not storing attribute vectors_norm
2020-07-31 21:29:23,069 INFO: not storing attribute cum_table
2020-07-31 21:29:23,130 INFO: saved ./word2vec_model/word2vec.bin
2020-07-31 21:29:23,131 INFO: storing 4324x100 projection weights into ./word2vec_model/word2vec.txt
4.生成目录
word2vec_model/
├── word2vec.bin
└── word2vec.txt0 directories, 2 files
推荐阅读
- 人工智能|hugginface-introduction 案例介绍
- 深度学习|论文阅读(《Deep Interest Evolution Network for Click-Through Rate Prediction》)
- nlp|Keras(十一)梯度带(GradientTape)的基本使用方法,与tf.keras结合使用
- NER|[论文阅读笔记01]Neural Architectures for Nested NER through Linearization
- 深度学习|2019年CS224N课程笔记-Lecture 17:Multitask Learning
- 深度学习|[深度学习] 一篇文章理解 word2vec
- 论文|预训练模型综述2020年三月《Pre-trained Models for Natural Language Processing: A Survey》
- NLP|NLP预训练模型综述
- NLP之文本表示——二值文本表示
- 隐马尔科夫HMM应用于中文分词