Python机器学习NLP自然语言处理基本操作电影影评分析 Python机器学习NLP自然语言处理

概述
RNN

权重共享
计算过程

LSTM

阶段

代码

预处理
主函数

概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

RNN RNN (Recurrent Neural Network), 即循环神经网络. RNN 相较于 CNN, 可以帮助我们更好的处理序列信息, 挖掘前后信息之间的联系. 对于 NLP 这类的任务, 语料的前后概率有极大的联系. 比如: “明天天气真好” 的概率 > “明天天气篮球”.
【Python机器学习NLP自然语言处理基本操作电影影评分析】

文章图片

权重共享
传统神经网络:

文章图片

RNN:

文章图片

RNN 的权重共享和 CNN 的权重共享类似, 不同时刻共享一个权重, 大大减少了参数数量.

计算过程

文章图片

计算状态 (State)

文章图片

计算输出:

文章图片

LSTM LSTM (Long Short Term Memory), 即长短期记忆模型. LSTM 是一种特殊的 RNN 模型, 解决了长序列训练过程中的梯度消失和梯度爆炸的问题. 相较于普通 RNN, LSTM 能够在更长的序列中有更好的表现. 相比 RNN 只有一个传递状态 ht, LSTM 有两个传递状态： ct (cell state) 和 ht (hidden state).

文章图片

阶段
LSTM 通过门来控制传输状态。
LSTM 总共分为三个阶段:

忘记阶段: 对上一个节点传进来的输入进行选择性忘记
选择记忆阶段: 将这个阶段的记忆有选择性的进行记忆. 哪些重要则着重记录下来, 哪些不重要, 则少记录一些
输出阶段: 决定哪些将会被当成当前状态的输出

代码
预处理

import pandas as pdimport refrom bs4 import BeautifulSoupfrom sklearn.model_selection import train_test_splitimport tensorflow as tf# 停用词stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"])stop_words = [word.strip() for word in stop_words["stop_words"].values]# 用pandas读取训练数据def load_data():# 语料data = https://www.it610.com/article/pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\")print(data[:5])print("评论数量:", len(data))return datadef pre_process(text):# 去除网页链接text = BeautifulSoup(text, "html.parser").get_text()# 去除标点text = re.sub("[^a-zA-Z]", " ", text)# 分词words = text.lower().split()# 去除停用词words = [w for w in words if w not in stop_words]return " ".join(words)def split_data():# 读取文件data = https://www.it610.com/article/pd.read_csv("data/train.csv")print(data.head())# 实例化tokenizer = tf.keras.preprocessing.text.Tokenizer()# 拟合tokenizer.fit_on_texts(data["review"])# 词袋word_index = tokenizer.word_indexprint(word_index)print(len(word_index))# 转换成数组sequence = tokenizer.texts_to_sequences(data["review"])# 填充character = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=200)# 标签转换labels = tf.keras.utils.to_categorical(data["sentiment"])# 分割数据集X_train, X_test, y_train, y_test = train_test_split(character, labels, test_size=0.2,random_state=0)return X_train, X_test, y_train, y_testif __name__ == '__main__':# ## data = https://www.it610.com/article/load_data()# data["review"] = data["review"].apply(pre_process)# print(data.head())## # 保存# data.to_csv("data.csv")split_data()

主函数

import tensorflow as tffrom lstm_pre_processing import split_datadef main():# 读取数据X_train, X_test, y_train, y_test = split_data()print(X_train[:5])print(y_train[:5])# 超参数EMBEDDING_DIM = 200# embedding 维度optimizer = tf.keras.optimizers.RMSprop()# 优化器loss = tf.losses.CategoricalCrossentropy(from_logits=True)# 损失# 模型model = tf.keras.Sequential([tf.keras.layers.Embedding(73424, EMBEDDING_DIM),tf.keras.layers.LSTM(200, dropout=0.2, recurrent_dropout=0.2),tf.keras.layers.Dropout(0.2),tf.keras.layers.Dense(64, activation="relu"),tf.keras.layers.Dense(2, activation="softmax")])model.build(input_shape=[None, 20])print(model.summary())# 组合model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])# 训练model.fit(X_train, y_train, validation_data=https://www.it610.com/article/(X_test, y_test), epochs=2, batch_size=32)# 保存模型model.save("movie_model.h5")if __name__ == '__main__':# 主函数main()

输出结果:

2021-09-14 22:20:56.974310: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0Unnamed: 0idsentimentreview005814_81stuff moment mj ve started listening music wat...112381_91classic war worlds timothy hines entertaining ...227759_30film starts manager nicholas bell investors ro...333630_40assumed praised film filmed opera didn read do...449495_81superbly trashy wondrously unpretentious explo...73423[[15958623 1236844596228353015220972408 35364 571438922997766 42223967266 25276157108696163119825769850374527523789950369652652354862474382101 110276966456 22390969587353764044623140120697186189296138134571496181231770518331435498318885208373983228 2863510442054401107185856589577226804462244472113269157421053217943504598037328873438389412319561229253 2717621491990 5714453487469665581362067 106824851814829366815873786211010506 25150 2074434033316174824389297814 101502596766 4222350824784700198627652547001982334696 208795863025832872 306013086288373329618222470830167737912164551322310513186045361925414132157874348516969975354 57145162302911839] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001357684283027 103715801 20987 21481 1980013027 10371 21481 198001719204491682507355154737440154152417192449168735515473610 21481 19800123204491681102154765621354325183614 6616620365675183202511650311145782] [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021891586218915185561540053943797 238662892481289281022020 17820174123120746202810406089816555541772176226811288879645] [00000000000000000000000000000000000000000000000000000000000000000000000000085310173478190678190614121985787644141224492877092637425846183379530801288221735346005485115437621797 261446992376745712881415900356232371669 179878744212341278347928716097100106575980033447650214738030151436665231396851 223303465 208617106637434060 1903530895081371695 1073535829263741768348601491 11540 28826184746440992235615122153810273892621951966308933 1989428714263741843740256732537421549 219762877442466 3153327283613396374 1480516704666603312] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001275246399577415458575855 104632688 2101915421701653976591897062212 18342566437263943114504 2611030749689331712752587]][[0. 1.] [0. 1.] [0. 1.] [1. 0.] [0. 1.]]2021-09-14 22:21:02.212681: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set2021-09-14 22:21:02.213245: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu2021-09-14 22:21:02.213268: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)2021-09-14 22:21:02.213305: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (5aa046a4f47b): /proc/driver/nvidia/version does not exist2021-09-14 22:21:02.213624: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:AVX512FTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-09-14 22:21:02.216309: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not setModel: "sequential"_________________________________________________________________Layer (type)Output ShapeParam #=================================================================embedding (Embedding)(None, None, 200)14684800_________________________________________________________________lstm (LSTM)(None, 200)320800_________________________________________________________________dropout (Dropout)(None, 200)0_________________________________________________________________dense (Dense)(None, 64)12864_________________________________________________________________dense_1 (Dense)(None, 2)130=================================================================Total params: 15,018,594Trainable params: 15,018,594Non-trainable params: 0_________________________________________________________________None2021-09-14 22:21:02.515404: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)2021-09-14 22:21:02.547745: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 HzEpoch 1/2313/313 [==============================] - 97s 302ms/step - loss: 0.5112 - accuracy: 0.7510 - val_loss: 0.3607 - val_accuracy: 0.8628Epoch 2/2313/313 [==============================] - 94s 300ms/step - loss: 0.2090 - accuracy: 0.9236 - val_loss: 0.3078 - val_accuracy: 0.8790

以上就是Python机器学习NLP自然语言处理基本操作电影影评分析的详细内容，更多关于NLP自然语言处理资料请关注脚本之家其它相关文章！