本文内容整理自DataWhale组织的NLP学习组队活动。
赛题名称
零基础入门NLP之新闻文本分类 (https://tianchi.aliyun.com/competition/entrance/531810/introduction)
关于Word2Vec的资料
- Word2Vec原理(一)CBOW与Skip-Gram模型基础(https://www.cnblogs.com/pinard/p/7160330.html)
- gensim文档——对Word2Vec的介绍(https://radimrehurek.com/gensim/models/word2vec.html)
import logging
import randomimport numpy as np
import torchlogging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)# split data to 10 fold
fold_num = 10
data_file = '../data/train_set.csv'
import pandas as pddef all_data2fold(fold_num, num=10000):
fold_data = https://www.it610.com/article/[]
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()[:num]
labels = f['label'].tolist()[:num]total = len(labels)index = list(range(total))
np.random.shuffle(index)all_texts = []
all_labels = []
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])label2id = {}
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# print(label, len(data))
batch_size = int(len(data) / fold_num)
other = len(data) - batch_size * fold_num
for i in range(fold_num):
cur_batch_size = batch_size + 1 if i < other else batch_size
# print(cur_batch_size)
batch_data = https://www.it610.com/article/[data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
for fold in range(fold_num):
num = len(all_index[fold])
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]if num> batch_size:
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
else:
fold_texts = texts
fold_labels = labelsassert batch_size == len(fold_labels)# shuffle
index = list(range(batch_size))
np.random.shuffle(index)shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])data = https://www.it610.com/article/{'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))return fold_datafold_data = https://www.it610.com/article/all_data2fold(10)''' Out
2020-07-30 23:30:04,912 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
'''# build train data for word2vec
fold_id = 9train_texts = []
for i in range(0, fold_id):
data = https://www.it610.com/article/fold_data[i]
train_texts.extend(data['text'])logging.info('Total %d docs.' % len(train_texts))''' Out
2020-07-18 23:30:04,929 INFO: Total 9000 docs.
'''logging.info('Start training...')
from gensim.models.word2vec import Word2Vecnum_features = 100# Word vector dimensionality
num_workers = 8# Number of threads to run in paralleltrain_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)# save model
model.save("./word2vec.bin")''' Out
(打印的log过长,省略)
'''# load model
model = Word2Vec.load("./word2vec.bin")# convert format
model.wv.save_word2vec_format('./word2vec.txt', binary=False)''' Out
(打印的log较长,省略)
'''
【NLP入门 Task05 基于神经网络的文本分类-Word2Vec】
推荐阅读
- 数组的排序算法
- 学习|Python学习心得,小白初学工具推荐
- jQuery---用jq实现控件的显示和隐藏
- 使一个布局中的所有事件失效
- 学习|python3打印菱形(测试过)
- 学习|自定义圆形progressbar(包含进度动画效果)
- Android|Android 实现 圆形进度对话框 和 水平进度对话框 —— ProgressDialog
- 多人合作项目使用Git进行代码控制
- 并发|11.防刷限流
- android|动态设置Progress值和颜色