如何在Python中使用TensorFlow 2和Keras构建文本生成器

使用循环神经网络 (RNN) 和 LSTM 以及 Python 中的 TensorFlow 和 Keras 框架构建深度学习模型以生成人类可读的文本。
Python如何构建文本生成器?Recurrent Neural Networks(RNN多个)是分类问题非常强大的序列模型。然而,在本教程中,我们要做一些不同的事情,我们将使用RNN作为生成模型,这意味着它们可以学习问题的序列,然后为问题域生成全新的序列。
阅读本教程后,你将结合Python构建文本生成器示例,学习如何在 Python 中使用TensorFlow和Keras构建可以生成文本(逐字符)的LSTM模型。
TensorFlow 2和Keras构建文本生成器:在文本生成中,我们向模型展示了许多训练示例,以便它可以学习输入和输出之间的模式。每个输入是一个字符序列,输出是下一个单个字符。例如,假设我们要训练句子"python is a great language",第一个样本的输入是"python is a great langua",输出将是"g"。第二个样本输入将是“ython is a great languag”,输出是“e”,依此类推,直到我们遍历整个数据集。我们需要向模型展示尽可能多的示例,以便做出合理的预测。
pip3 install tensorflow==2.0.1 numpy requests tqdm

import tensorflow as tf import numpy as np import os import pickle from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, LSTM, Dropout from string import punctuation

准备数据集Python构建文本生成器示例:我们将使用一本可免费下载的书籍作为本教程的数据集:Lewis Carroll 的 Alice's Adventures in Wonderland。但是你可以使用任何你想要的书籍/语料库。
import requests content = requests.get("http://www.gutenberg.org/cache/epub/11/pg11.txt").text open("data/wonderland.txt", "w", encoding="utf-8").write(content)

sequence_length = 100 BATCH_SIZE = 128 EPOCHS = 30 # dataset file path FILE_PATH = "data/wonderland.txt" BASENAME = os.path.basename(FILE_PATH) # read the data text = open(FILE_PATH, encoding="utf-8").read() # remove caps, comment this code if you want uppercase characters as well text = text.lower() # remove punctuation text = text.translate(str.maketrans("", "", punctuation))

# print some stats n_chars = len(text) vocab = ''.join(sorted(set(text))) print("unique_chars:", vocab) n_unique_chars = len(vocab) print("Number of characters:", n_chars) print("Number of unique characters:", n_unique_chars)

【如何在Python中使用TensorFlow 2和Keras构建文本生成器()】输出:
unique_chars: 0123456789abcdefghijklmnopqrstuvwxyz Number of characters: 154207 Number of unique characters: 39

复制现在我们成功加载并清理了数据集,我们需要一种将这些字符转换为整数的方法,有很多 Keras 和 Scikit-Learn 实用程序可以做到这一点,但我们将在 Python 中手动进行。
# dictionary that converts characters to integers char2int = {c: i for i, c in enumerate(vocab)} # dictionary that converts integers to characters int2char = {i: c for i, c in enumerate(vocab)}

# save these dictionaries for later generation pickle.dump(char2int, open(f"{BASENAME}-char2int.pickle", "wb")) pickle.dump(int2char, open(f"{BASENAME}-int2char.pickle", "wb"))

TensorFlow 2和Keras构建文本生成器:现在让我们对我们的数据集进行编码,换句话说,我们要将每个字符转换为其相应的整数:
# convert all text into integers encoded_text = np.array([ char2int[ c] for c in text])

# construct tf.data.Dataset object char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

# print first 5 characters for char in char_dataset.take(8): print(char.numpy(), int2char[ char.numpy()])

这将取前8 个字符并将它们连同它们的整数表示一起打印出来:
38 27 p 29 r 26 o 21 j 16 e 14 c

# build sequences by batching sequences = char_dataset.batch(2*sequence_length + 1, drop_remainder=True)# print sequences for sequence in sequences.take(2): print(''.join([ int2char[ i] for i in sequence.numpy()]))

你可能会注意到,我使用了每个样本的2*sequence_length +1大小,你很快就会明白我为什么这样做,检查输出:
project gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoeveryou may copy it give it away orreuse it under the terms of the project gutenberg license included with this ebook or online at wwwgutenbergorg

def split_sample(sample): # example : # sequence_length is 10 # sample is "python is a great pro" (21 length) # ds will equal to ('python is ', 'a') encoded as integers ds = tf.data.Dataset.from_tensors((sample[ :sequence_length], sample[ sequence_length])) for i in range(1, (len(sample)-1) // 2): # first (input_, target) will be ('ython is a', ' ') # second (input_, target) will be ('thon is a ', 'g') # third (input_, target) will be ('hon is a g', 'r') # and so on input_ = sample[ i: i+sequence_length] target = sample[ i+sequence_length] # extend the dataset with these samples by concatenate() method other_ds = tf.data.Dataset.from_tensors((input_, target)) ds = ds.concatenate(other_ds) return ds# prepare inputs and targets dataset = sequences.flat_map(split_sample)

为了更好地理解上述代码的工作原理,让我们举个例子:假设我们有一个序列长度为 10(太小但很好解释),sample参数是一个21 个字符的序列(记住2*sequence_length+ 1  ) 以整数编码,为方便起见,让我们假设它没有编码,说它是“python is a great pro”。
现在我们要生成的第一个数据样本将是以下输入和目标元组('python is ', 'a'),第二个是('ython is a', ' '),第三个是('thon is a ', 'g')等。我们对所有样本都这样做,最后,我们会看到我们显着增加了训练样本的数量。我们使用ds.concatenate()方法将这些样本加在一起。
Python构建文本生成器示例 - 在我们构建了我们的样本之后,让我们对输入和标签(目标)进行单热编码:
def one_hot_samples(input_, target): # onehot encode the inputs and the targets # Example: # if character 'd' is encoded as 3 and n_unique_chars = 5 # result should be the vector: [ 0, 0, 0, 1, 0], since 'd' is the 4th character return tf.one_hot(input_, n_unique_chars), tf.one_hot(target, n_unique_chars)

# print first 2 samples for element in dataset.take(2): print("Input:", ''.join([ int2char[ np.argmax(char_vector)] for char_vector in element[ 0].numpy()])) print("Target:", int2char[ np.argmax(element[ 1].numpy())]) print("Input shape:", element[ 0].shape) print("Target shape:", element[ 1].shape) print("="*50, "\n")

Input: project gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of an Target: y Input shape: (100, 39) Target shape: (39,)

所以每个输入元素的形状为(序列长度,词汇大小),在这种情况下,有39 个唯一字符,100 个是序列长度。输出的形状是一个单热编码的一维向量。
注意:如果你使用不同的数据集和/或使用其他字符过滤机制,你将看到不同的词汇量,每个问题都有自己的领域。例如,我也用它来生成 Python 代码,它有 92 个唯一字符,那是因为我应该允许一些 Python 代码所需的标点符号。
# repeat, shuffle and batch the dataset ds = dataset.repeat().shuffle(1024).batch(BATCH_SIZE, drop_remainder=True)

我们将可选设置为drop_remainder  True以便我们可以消除大小小于 的剩余样本BATCH_SIZE
构建模型现在让我们构建模型,它基本上有两个 LSTM 层,具有任意数量的128 个LSTM 单元。尝试尝试不同的模型架构,你可以随心所欲!
输出层是一个全连接层,有39 个单元,其中每个神经元对应一个字符(每个字符出现的概率)。
model = Sequential([ LSTM(256, input_shape=(sequence_length, n_unique_chars), return_sequences=True), Dropout(0.3), LSTM(256), Dense(n_unique_chars, activation="softmax"), ])

我们在这里使用Adam 优化器,我建议你尝试不同的优化器。
# define the model path model_weights_path = f"results/{BASENAME}-{sequence_length}.h5" model.summary() model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=[ "accuracy"])

训练模型TensorFlow 2和Keras构建文本生成器:现在让我们训练模型:
# make results folder if does not exist yet if not os.path.isdir("results"): os.mkdir("results") # train the model model.fit(ds, steps_per_epoch=(len(encoded_text) - sequence_length) // BATCH_SIZE, epochs=EPOCHS) # save the model model.save(model_weights_path)

Train for 6473 steps ... < SNIPPED>Epoch 29/30 6473/6473 [ ==============================] - 486s 75ms/step - loss: 0.8728 - accuracy: 0.7509 Epoch 30/30 2576/6473 [ ==========>...................] - ETA: 4:56 - loss: 0.8063 - accuracy: 0.7678

import numpy as np import pickle import tqdm from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, LSTM, Dropout, Activation import ossequence_length = 100 # dataset file path FILE_PATH = "data/wonderland.txt" # FILE_PATH = "data/python_code.py" BASENAME = os.path.basename(FILE_PATH)

seed = "chapter xiii"

# load vocab dictionaries char2int = pickle.load(open(f"{BASENAME}-char2int.pickle", "rb")) int2char = pickle.load(open(f"{BASENAME}-int2char.pickle", "rb")) vocab_size = len(char2int)

# building the model model = Sequential([ LSTM(256, input_shape=(sequence_length, vocab_size), return_sequences=True), Dropout(0.3), LSTM(256), Dense(vocab_size, activation="softmax"), ])

# load the optimal weights model.load_weights(f"results/{BASENAME}-{sequence_length}.h5")

s = seed n_chars = 400 # generate 400 characters generated = "" for i in tqdm.tqdm(range(n_chars), "Generating text"): # make the input sequence X = np.zeros((1, sequence_length, vocab_size)) for t, char in enumerate(seed): X[ 0, (sequence_length - len(seed)) + t, char2int[ char]] = 1 # predict the next character predicted = model.predict(X, verbose=0)[ 0] # converting the vector to an integer next_index = np.argmax(predicted) # converting the integer to a character next_char = int2char[ next_index] # add the character to results generated += next_char # shift seed and the predicted character seed = seed[ 1:] + next_charprint("Seed:", s) print("Generated text:") print(generated)

然后我们将这个更新的输入序列输入到模型中以预测另一个字符,重复这个过程N次将生成一个包含N 个字符的文本。
Python构建文本生成器示例 - 这是一个有趣的文本生成:
Seed: chapter xiii Generated Text: ded of and alice as it go on and the court well you wont you wouldncopy thing there was not a long to growing anxiously any only a low every cant go on a litter which was proves of any only here and the things and the mort meding and the mort and alice was the things said to herself i cant remeran as if i can repeat eften to alice any of great offf its archive of and alice and a cancur as the mo

但请注意,这不仅限于英文文本,你可以使用任何类型的文本。事实上,一旦拥有足够多的代码行,你甚至可以生成 Python 代码。
TensorFlow 2和Keras构建文本生成器总结太好了,我们完成了。现在你知道如何:
  • 在 TensorFlow 和 Keras 中制作 RNN 作为生成模型。
  • 使用tf.data API清理文本和构建 TensorFlow 输入管道。
  • 在文本序列上训练 LSTM 网络。
  • 调整模型的性能。
  • 通过删除稀有字符来减少词汇量。
  • 使用更多 LSTM 单元添加更多 LSTM 和Dropout 层,甚至添加双向层。
  • 调整一些超参数,例如批量大小、优化器甚至sequence_length,看看哪个效果最好。
  • 训练更多的时代。
  • 使用更大的数据集。
