word2vec+lstm做句子分类|word2vec+lstm做句子分类 简单例子
数据
3万文本,train val test 6 2 2.
工具、手法
pytorch、sklearn、gensim的word2vec。
word2vec嵌入句子进行表示,padding后,用LSTM+linear对句序列向量分类。
代码
import jieba
import xgboost as xgb
from sklearn.model_selection import train_test_split
import numpy as np
from gensim.models import Word2Vec# reorganize data
def get_split_sentences(file_path):
res_sen=[]
with open(file_path) as f:
for line in f:
split_query=jieba.lcut(line.strip())
res_sen.append(split_query)
return res_senlabel2_sentences=get_split_sentences('label2.csv')
label0_sentences=get_split_sentences('label0.csv')
label1_sentences=get_split_sentences('label1.csv')all_sentences=[]
all_sentences.extend(label0_sentences)
all_sentences.extend(label1_sentences)
all_sentences.extend(label2_sentences)# set params
emb_size=128
win=3
model=Word2Vec(sentences=all_sentences,vector_size=emb_size,window=win,min_count=1)
# retrieve word embeddings
w2vec=model.wv# assemble sentence embeddings
def assemble_x(w2vec:dict,sentences):
sen_vs=[]
for sen in sentences:
v=np.vstack([w2vec[w] for w in sen])
v_len=v.shape[0]sen_v=np.concatenate((v,np.zeros((max_len-v_len,emb_size)))) if v_len
结果 ACC: 0.4303
macro:
Recall: 0.3333
F1-score: 0.2006
Precision: 0.1434
micro:
Recall: 0.4303
F1-score: 0.4303
Precision: 0.4303
小结 【word2vec+lstm做句子分类|word2vec+lstm做句子分类 简单例子】效果非常差,原因主要有
- padding的0向量过于多了,导致模型得到的大部分都是0向量;
- 并未对lstm做任何参数调整(懒
推荐阅读
- SQL也能做AI|SQL也能做AI (没错!MLOps Meetup V3 回顾|OpenMLBD+SQLFlow+Byzer)
- 投稿|为什么下沉市场很难做出好公司:从拼多多和淘特开始聊起
- 教你简单做选择
- 故乡有约|你有多久没吃一碗妈妈包的饺子爸爸做的手工面
- 斗地主技巧系列之四-------怎样做门板在斗地主中
- 你还在浪费生命(或者,和我们一起做这件事)
- 生无可恋
- 如果有来生,我要做你手上的一根指头
- 投稿|小红书下场做旅游和酒店,会不会有个好下场?
- 依旧复习