Python机器学习NLP自然语言处理基本操作电影影评分析

目录

  • 概述
  • RNN
    • 权重共享
    • 计算过程
  • LSTM
    • 阶段
  • 代码
    • 预处理
    • 主函数

概述 从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.


RNN RNN (Recurrent Neural Network), 即循环神经网络. RNN 相较于 CNN, 可以帮助我们更好的处理序列信息, 挖掘前后信息之间的联系. 对于 NLP 这类的任务, 语料的前后概率有极大的联系. 比如: “明天天气真好” 的概率 > “明天天气篮球”.
【Python机器学习NLP自然语言处理基本操作电影影评分析】Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片


权重共享
传统神经网络:
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片

RNN:
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片

RNN 的权重共享和 CNN 的权重共享类似, 不同时刻共享一个权重, 大大减少了参数数量.

计算过程
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片

计算状态 (State)
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片

计算输出:
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片


LSTM LSTM (Long Short Term Memory), 即长短期记忆模型. LSTM 是一种特殊的 RNN 模型, 解决了长序列训练过程中的梯度消失和梯度爆炸的问题. 相较于普通 RNN, LSTM 能够在更长的序列中有更好的表现. 相比 RNN 只有一个传递状态 ht, LSTM 有两个传递状态: ct (cell state) 和 ht (hidden state).
Python机器学习NLP自然语言处理基本操作电影影评分析
文章图片


阶段
LSTM 通过门来控制传输状态。
LSTM 总共分为三个阶段:
  • 忘记阶段: 对上一个节点传进来的输入进行选择性忘记
  • 选择记忆阶段: 将这个阶段的记忆有选择性的进行记忆. 哪些重要则着重记录下来, 哪些不重要, 则少记录一些
  • 输出阶段: 决定哪些将会被当成当前状态的输出

代码
预处理
import pandas as pdimport refrom bs4 import BeautifulSoupfrom sklearn.model_selection import train_test_splitimport tensorflow as tf# 停用词stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"])stop_words = [word.strip() for word in stop_words["stop_words"].values]# 用pandas读取训练数据def load_data():# 语料data = https://www.it610.com/article/pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\")print(data[:5])print("评论数量:", len(data))return datadef pre_process(text):# 去除网页链接text = BeautifulSoup(text, "html.parser").get_text()# 去除标点text = re.sub("[^a-zA-Z]", " ", text)# 分词words = text.lower().split()# 去除停用词words = [w for w in words if w not in stop_words]return " ".join(words)def split_data():# 读取文件data = https://www.it610.com/article/pd.read_csv("data/train.csv")print(data.head())# 实例化tokenizer = tf.keras.preprocessing.text.Tokenizer()# 拟合tokenizer.fit_on_texts(data["review"])# 词袋word_index = tokenizer.word_indexprint(word_index)print(len(word_index))# 转换成数组sequence = tokenizer.texts_to_sequences(data["review"])# 填充character = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=200)# 标签转换labels = tf.keras.utils.to_categorical(data["sentiment"])# 分割数据集X_train, X_test, y_train, y_test = train_test_split(character, labels, test_size=0.2,random_state=0)return X_train, X_test, y_train, y_testif __name__ == '__main__':# ## data = https://www.it610.com/article/load_data()# data["review"] = data["review"].apply(pre_process)# print(data.head())## # 保存# data.to_csv("data.csv")split_data()


主函数
import tensorflow as tffrom lstm_pre_processing import split_datadef main():# 读取数据X_train, X_test, y_train, y_test = split_data()print(X_train[:5])print(y_train[:5])# 超参数EMBEDDING_DIM = 200# embedding 维度optimizer = tf.keras.optimizers.RMSprop()# 优化器loss = tf.losses.CategoricalCrossentropy(from_logits=True)# 损失# 模型model = tf.keras.Sequential([tf.keras.layers.Embedding(73424, EMBEDDING_DIM),tf.keras.layers.LSTM(200, dropout=0.2, recurrent_dropout=0.2),tf.keras.layers.Dropout(0.2),tf.keras.layers.Dense(64, activation="relu"),tf.keras.layers.Dense(2, activation="softmax")])model.build(input_shape=[None, 20])print(model.summary())# 组合model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])# 训练model.fit(X_train, y_train, validation_data=https://www.it610.com/article/(X_test, y_test), epochs=2, batch_size=32)# 保存模型model.save("movie_model.h5")if __name__ == '__main__':# 主函数main()

输出结果:
2021-09-14 22:20:56.974310: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0Unnamed: 0idsentimentreview005814_81stuff moment mj ve started listening music wat...112381_91classic war worlds timothy hines entertaining ...227759_30film starts manager nicholas bell investors ro...333630_40assumed praised film filmed opera didn read do...449495_81superbly trashy wondrously unpretentious explo...73423[[15958623 1236844596228353015220972408 35364 571438922997766 42223967266 25276157108696163119825769850374527523789950369652652354862474382101 110276966456 22390969587353764044623140120697186189296138134571496181231770518331435498318885208373983228 2863510442054401107185856589577226804462244472113269157421053217943504598037328873438389412319561229253 2717621491990 5714453487469665581362067 106824851814829366815873786211010506 25150 2074434033316174824389297814 101502596766 4222350824784700198627652547001982334696 208795863025832872 306013086288373329618222470830167737912164551322310513186045361925414132157874348516969975354 57145162302911839] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001357684283027 103715801 20987 21481 1980013027 10371 21481 198001719204491682507355154737440154152417192449168735515473610 21481 19800123204491681102154765621354325183614 6616620365675183202511650311145782] [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021891586218915185561540053943797 238662892481289281022020 17820174123120746202810406089816555541772176226811288879645] [00000000000000000000000000000000000000000000000000000000000000000000000000085310173478190678190614121985787644141224492877092637425846183379530801288221735346005485115437621797 261446992376745712881415900356232371669 179878744212341278347928716097100106575980033447650214738030151436665231396851 223303465 208617106637434060 1903530895081371695 1073535829263741768348601491 11540 28826184746440992235615122153810273892621951966308933 1989428714263741843740256732537421549 219762877442466 3153327283613396374 1480516704666603312] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001275246399577415458575855 104632688 2101915421701653976591897062212 18342566437263943114504 2611030749689331712752587]][[0. 1.] [0. 1.] [0. 1.] [1. 0.] [0. 1.]]2021-09-14 22:21:02.212681: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set2021-09-14 22:21:02.213245: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu2021-09-14 22:21:02.213268: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)2021-09-14 22:21:02.213305: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (5aa046a4f47b): /proc/driver/nvidia/version does not exist2021-09-14 22:21:02.213624: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:AVX512FTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-09-14 22:21:02.216309: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not setModel: "sequential"_________________________________________________________________Layer (type)Output ShapeParam #=================================================================embedding (Embedding)(None, None, 200)14684800_________________________________________________________________lstm (LSTM)(None, 200)320800_________________________________________________________________dropout (Dropout)(None, 200)0_________________________________________________________________dense (Dense)(None, 64)12864_________________________________________________________________dense_1 (Dense)(None, 2)130=================================================================Total params: 15,018,594Trainable params: 15,018,594Non-trainable params: 0_________________________________________________________________None2021-09-14 22:21:02.515404: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)2021-09-14 22:21:02.547745: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 HzEpoch 1/2313/313 [==============================] - 97s 302ms/step - loss: 0.5112 - accuracy: 0.7510 - val_loss: 0.3607 - val_accuracy: 0.8628Epoch 2/2313/313 [==============================] - 94s 300ms/step - loss: 0.2090 - accuracy: 0.9236 - val_loss: 0.3078 - val_accuracy: 0.8790

以上就是Python机器学习NLP自然语言处理基本操作电影影评分析的详细内容,更多关于NLP自然语言处理资料请关注脚本之家其它相关文章!

    推荐阅读