word2vec+lstm做句子分类|word2vec+lstm做句子分类 简单例子

数据 3万文本,train val test 6 2 2.
工具、手法 pytorch、sklearn、gensim的word2vec。
word2vec嵌入句子进行表示,padding后,用LSTM+linear对句序列向量分类。
代码

import jieba import xgboost as xgb from sklearn.model_selection import train_test_split import numpy as np from gensim.models import Word2Vec# reorganize data def get_split_sentences(file_path): res_sen=[] with open(file_path) as f: for line in f: split_query=jieba.lcut(line.strip()) res_sen.append(split_query) return res_senlabel2_sentences=get_split_sentences('label2.csv') label0_sentences=get_split_sentences('label0.csv') label1_sentences=get_split_sentences('label1.csv')all_sentences=[] all_sentences.extend(label0_sentences) all_sentences.extend(label1_sentences) all_sentences.extend(label2_sentences)# set params emb_size=128 win=3 model=Word2Vec(sentences=all_sentences,vector_size=emb_size,window=win,min_count=1) # retrieve word embeddings w2vec=model.wv# assemble sentence embeddings def assemble_x(w2vec:dict,sentences): sen_vs=[] for sen in sentences: v=np.vstack([w2vec[w] for w in sen]) v_len=v.shape[0]sen_v=np.concatenate((v,np.zeros((max_len-v_len,emb_size)))) if v_len

结果 ACC: 0.4303
macro:
Recall: 0.3333
F1-score: 0.2006
Precision: 0.1434
micro:
Recall: 0.4303
F1-score: 0.4303
Precision: 0.4303
小结 【word2vec+lstm做句子分类|word2vec+lstm做句子分类 简单例子】效果非常差,原因主要有
  • padding的0向量过于多了,导致模型得到的大部分都是0向量;
  • 并未对lstm做任何参数调整(懒

    推荐阅读