Python机器学习NLP自然语言处理Word2vec电影影评建模
目录
- 概述
- 词向量
- 词向量维度
- 代码实现
- 预处理
- 主程序
概述 【Python机器学习NLP自然语言处理Word2vec电影影评建模】从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.
词向量 我们先来说说词向量究竟是什么. 当我们把文本交给算法来处理的时候, 计算机并不能理解我们输入的文本, 词向量就由此而生了. 简单的来说, 词向量就是将词语转换成数字组成的向量.
文章图片
当我们描述一个人的时候, 我们会使用身高体重等种种指标, 这些指标就可以当做向量. 有了向量我们就可以使用不同方法来计算相似度.
文章图片
那我们如何来描述语言的特征呢? 我们把语言分割成一个个词, 然后在词的层面上构建特征.
文章图片
词向量维度 词向量的维度越高, 其所能提供的信息也就越多, 计算结果的可靠性就更值得信赖.
50 维的词向量:
文章图片
用热度图表示一下:
文章图片
文章图片
从上图我们可以看出, 相似的词在特征表达中比较相似. 由此也可以证明词的特征是有意义的.
代码实现
预处理
import numpy as npimport pandas as pdimport itertoolsimport refrom bs4 import BeautifulSoupfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom matplotlib import pyplot as pltimport nltk# 停用词stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"])stop_words = [word.strip() for word in stop_words["stop_words"].values]def load_train_data():"""读取训练数据"""# 语料data = https://www.it610.com/article/pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\")print(data[:5])print("训练评论数量:", len(data))# 25,000return datadef load_test_data():# 语料data = https://www.it610.com/article/pd.read_csv("data/unlabeledTrainData.tsv", sep="\t", escapechar="\\")print("测试评论数量:", len(data))# 50,000return datadef pre_process(text):# 去除网页链接text = BeautifulSoup(text, "html.parser").get_text()# 去除标点text = re.sub("[^a-zA-Z]", " ", text)# 分词words = text.lower().split()# 去除停用词words = [w for w in words if w not in stop_words]return " ".join(words)def split_train_data():# 读取文件data = https://www.it610.com/article/pd.read_csv("data/train.csv")print(data.head())# 抽取bag of words特征vec = CountVectorizer(max_features=5000)# 拟合vec.fit(data["review"])# 转换train_data_features = vec.transform(data["review"]).toarray()print(train_data_features.shape)# 词袋print(vec.get_feature_names())# 分割数据集X_train, X_test, y_train, y_test = train_test_split(train_data_features, data["sentiment"], test_size=0.2,random_state=0)return X_train, X_test, y_train, y_testdef test():# 读取测试数据data = https://www.it610.com/article/pd.read_csv("data/test.csv")print(data.head())tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")# 分词def split_sentences(review):raw_sentences = tokenizer.tokenize(review.strip())return sentencessentences = sum(data["review"][:10].apply(split_sentences), [])def visualize(cm, classes, title="Confusion matrix", cmap=plt.cm.Blues):plt.imshow(cm, interpolation="nearest", cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=0)plt.yticks(tick_marks, classes)thresh = cm.max()for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel("True label")plt.xlabel("Predicted label")plt.show()if __name__ == '__main__':# # 处理训练数据# train_data = https://www.it610.com/article/load_train_data()# train_data["review"] = train_data["review"].apply(pre_process)# print(train_data.head())## # 保存# train_data.to_csv("data/train.csv")# # 处理训练数据# test_data = https://www.it610.com/article/load_test_data()# test_data["review"] =test_data["review"].apply(pre_process)# print( test_data.head())## # 保存# test_data.to_csv("data/test.csv")split_train_data()
主程序
import pandas as pdimport nltkfrom gensim.models.word2vec import Word2Vecdef pre_process():"""预处理"""# 读取测试数据data = https://www.it610.com/article/pd.read_csv("data/test.csv")print(data.head())# 存放结果result = []# 分词for line in data["review"]:result.append(nltk.word_tokenize(line))return resultdef main():# 获取分词语料word_list = pre_process()# 设定词向量训练的参数num_features = 300# Word vector dimensionalitymin_word_count = 40# Minimum word countnum_workers = 4# Number of threads to run in parallelcontext = 10# Context window sizemodel_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)# 创建w2c模型model = Word2Vec(sentences=word_list, workers=num_workers,vector_size=num_features, min_count=min_word_count,window=context)# 保存模型model.save(model_name)def test():# 加载模型model = Word2Vec.load("300features_40minwords_10context.model")# 不匹配match = model.wv.doesnt_match(['man','woman','child','kitchen'])print(match)# 最相似print(model.wv.most_similar("boy"))print(model.wv.most_similar("bad"))if __name__ == '__main__':test()
输出结果:
2021-09-16 20:36:40.791181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0Unnamed: 0idsentimentreview005814_81stuff moment mj ve started listening music wat...112381_91classic war worlds timothy hines entertaining ...227759_30film starts manager nicholas bell investors ro...333630_40assumed praised film filmed opera didn read do...449495_81superbly trashy wondrously unpretentious explo...73423[[15958623 1236844596228353015220972408 35364 571438922997766 42223967266 25276157108696163119825769850374527523789950369652652354862474382101 110276966456 22390969587353764044623140120697186189296138134571496181231770518331435498318885208373983228 2863510442054401107185856589577226804462244472113269157421053217943504598037328873438389412319561229253 2717621491990 5714453487469665581362067 106824851814829366815873786211010506 25150 2074434033316174824389297814 101502596766 4222350824784700198627652547001982334696 208795863025832872 306013086288373329618222470830167737912164551322310513186045361925414132157874348516969975354 57145162302911839] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001357684283027 103715801 20987 21481 1980013027 10371 21481 198001719204491682507355154737440154152417192449168735515473610 21481 19800123204491681102154765621354325183614 6616620365675183202511650311145782] [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021891586218915185561540053943797 238662892481289281022020 17820174123120746202810406089816555541772176226811288879645] [00000000000000000000000000000000000000000000000000000000000000000000000000085310173478190678190614121985787644141224492877092637425846183379530801288221735346005485115437621797 261446992376745712881415900356232371669 179878744212341278347928716097100106575980033447650214738030151436665231396851 223303465 208617106637434060 1903530895081371695 1073535829263741768348601491 11540 28826184746440992235615122153810273892621951966308933 1989428714263741843740256732537421549 219762877442466 3153327283613396374 1480516704666603312] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001275246399577415458575855 104632688 2101915421701653976591897062212 18342566437263943114504 2611030749689331712752587]][[0. 1.] [0. 1.] [0. 1.] [1. 0.] [0. 1.]]2021-09-16 20:36:46.488438: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set2021-09-16 20:36:46.489070: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu2021-09-16 20:36:46.489097: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)2021-09-16 20:36:46.489128: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (313c6f2d15e2): /proc/driver/nvidia/version does not exist2021-09-16 20:36:46.489488: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:AVX512FTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-09-16 20:36:46.493241: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not setModel: "sequential"_________________________________________________________________Layer (type)Output ShapeParam #=================================================================embedding (Embedding)(None, None, 200)14684800_________________________________________________________________lstm (LSTM)(None, 200)320800_________________________________________________________________dropout (Dropout)(None, 200)0_________________________________________________________________dense (Dense)(None, 64)12864_________________________________________________________________dense_1 (Dense)(None, 2)130=================================================================Total params: 15,018,594Trainable params: 15,018,594Non-trainable params: 0_________________________________________________________________None2021-09-16 20:36:46.792534: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)2021-09-16 20:36:46.830442: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 HzEpoch 1/2313/313 [==============================] - 101s 315ms/step - loss: 0.5581 - accuracy: 0.7229 - val_loss: 0.3703 - val_accuracy: 0.8486Epoch 2/2313/313 [==============================] - 98s 312ms/step - loss: 0.2174 - accuracy: 0.9195 - val_loss: 0.3016 - val_accuracy: 0.8822
以上就是Python机器学习NLP自然语言处理Word2vec电影影评建模的详细内容,更多关于NLP自然语言处理的资料请关注脚本之家其它相关文章!
推荐阅读
- 由浅入深理解AOP
- 继续努力,自主学习家庭Day135(20181015)
- python学习之|python学习之 实现QQ自动发送消息
- 逻辑回归的理解与python示例
- 一起来学习C语言的字符串转换函数
- python自定义封装带颜色的logging模块
- 【Leetcode/Python】001-Two|【Leetcode/Python】001-Two Sum
- 定制一套英文学习方案
- 漫画初学者如何学习漫画背景的透视画法(这篇教程请收藏好了!)
- 《深度倾听》第5天──「RIA学习力」便签输出第16期