深度学习&神经网络|Transformer解析与tensorflow代码解读 python|深度学习|java|tensorflow|人

本文是针对谷歌Transformer模型的解读，根据我自己的理解顺序记录的。
另外，针对Kyubyong实现的tensorflow代码进行解读，代码地址https://github.com/Kyubyong/transformer
这里不会详细描述Transformer的实现机理，如果有不了解Transformer的可以先阅读文章《Attention is all you need》，以及我列出的一些参考博客，都是不错的解读。
Layer Normalization 首先是Layer Normalization部分，和Batch Normalization有点不一样，BN能够让模型收敛的更快，但是BN的缺点也比较明显。
BN的缺点：
1，BN特别依赖Batch Size；当Batch size很小的适合，BN的效果就非常不理想了。在很多情况下，Batch size大不了，因为你GPU的显存不够。所以，通常会有其他比较麻烦的手段去解决这个问题，比如MegDet的CGBN等；
2，BN对处理序列化数据的网络比如RNN是不太适用的；So，BN的应用领域减少了一半。
3，BN只在训练的时候用，inference的时候不会用到，因为inference的输入不是批量输入。这也不一定是BN的缺点，但这是BN的特点。
BN是在batch的方向上计算均值方差，而LN是在每一条数据维度的方向上计算均值方差，换句话说，LN的操作类似于将BN做了一个“转置”，对同一层网络的输出做一个标准化。下图比较清晰：

文章图片

文章图片

1 def ln(inputs, epsilon = 1e-8, scope="ln"): 2'''Applies layer normalization. See https://arxiv.org/abs/1607.06450. 3inputs: A tensor with 2 or more dimensions, where the first dimension has `batch_size`. 4epsilon: A floating number. A very small number for preventing ZeroDivision Error. 5scope: Optional scope for `variable_scope`. 6 7Returns: 8A tensor with the same shape and data dtype as `inputs`. 9''' 10 11 12''' 13使用层归一layer normalization 14tensorflow 在实现 Batch Normalization(各个网络层输出的归一化)时，主要用到nn.moments和batch_normalization 15其中moments作用是统计矩，mean 是一阶矩，variance 则是二阶中心矩 16tf.nn.moments 计算返回的 mean 和 variance 作为 tf.nn.batch_normalization 参数进一步调用 17:param inputs: 一个有2个或更多维度的张量，第一个维度是batch_size 18:param epsilon: 很小的数值，防止区域划分错误 19:param scope: 20:return: 返回一个与inputs相同shape和数据的dtype 21''' 22with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 23inputs_shape = inputs.get_shape() 24params_shape = inputs_shape[-1:] 25 26mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True) 27beta= tf.get_variable("beta", params_shape, initializer=tf.zeros_initializer()) 28gamma = tf.get_variable("gamma", params_shape, initializer=tf.ones_initializer()) 29normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) ) 30outputs = gamma * normalized + beta 31 32return outputs

View Code Mask 这部分比较重要，我们知道作者一开始在Mask方面的代码是写的有些问题的，后来作者做了一些更改，很多人看到这部分代码有点不知所云，单点调试之后会好一些。
mask表示掩码，它对某些值进行掩盖，使其在参数更新时不产生效果。Transformer 模型里面涉及两种 mask，分别是 padding mask 和 sequence mask。
其中，padding mask 在所有的 scaled dot-product attention 里面都需要用到，而 sequence mask 只有在 decoder 的 self-attention 里面用到。
Padding Mask 对于输入序列一般我们都要进行padding补齐，也就是说设定一个统一长度N，在较短的序列后面填充0到长度为N，如果输入的序列长度大于N，则截取左边长度为N的内容，把多余的直接舍弃。对于那些补零的数据来说，我们的attention机制不应该把注意力放在这些位置上，所以我们需要进行一些处理。具体的做法是，把这些位置的值加上一个非常大的负数(负无穷)，这样经过softmax后，这些位置的权重就会接近0。Transformer的padding mask实际上是一个张量，每个值都是一个Boolean，值为false的地方就是要进行处理的地方。
Sequence Mask sequence mask是为了使decoder不能看见未来的信息。因为Transformer不是rnn结构的，因此我们要想办法在time_step 为 t 的时刻，把 t 时刻之后的信息隐藏起来。具体做法就是产生一个上三角矩阵，上三角的值全为0，把这个矩阵作用在每一个序列上。
对于 decoder 的 self-attention，里面使用到的 scaled dot-product attention，同时需要padding mask 和 sequence mask 作为 attn_mask，具体实现就是两个mask相加作为attn_mask。
其他情况，attn_mask 一律等于 padding mask。
这边代码中会用到一些tf的函数，一个比较有用的tf.where()的用法：https://blog.csdn.net/ustbbsy/article/details/79564828
注意这段代码里面type in ("f", "future", "right"): 部分是描述用一个下三角矩阵来做sequence mask的。

1 def mask(inputs, queries=None, keys=None, type=None): 2''' 3对Keys或Queries进行遮盖 4:param inputs: (N, T_q, T_k) 5:param queries: (N, T_q, d) 6:param keys: (N, T_k, d) 7:return: 8''' 9"""Masks paddings on keys or queries to inputs 10inputs: 3d tensor. (N, T_q, T_k) 11queries: 3d tensor. (N, T_q, d) 12keys: 3d tensor. (N, T_k, d) 13 14e.g., 15>> queries = tf.constant([[[1.], 16[2.], 17[0.]]], tf.float32) # (1, 3, 1) 18>> keys = tf.constant([[[4.], 19[0.]]], tf.float32)# (1, 2, 1) 20>> inputs = tf.constant([[[4., 0.], 21[8., 0.], 22[0., 0.]]], tf.float32) 23>> mask(inputs, queries, keys, "key") 24array([[[ 4.0000000e+00, -4.2949673e+09], 25[ 8.0000000e+00, -4.2949673e+09], 26[ 0.0000000e+00, -4.2949673e+09]]], dtype=float32) 27>> inputs = tf.constant([[[1., 0.], 28[1., 0.], 29[1., 0.]]], tf.float32) 30>> mask(inputs, queries, keys, "query") 31array([[[1., 0.], 32[1., 0.], 33[0., 0.]]], dtype=float32) 34""" 35 36padding_num = -2 ** 32 + 1 37if type in ("k", "key", "keys"): 38# Generate masks 39masks = tf.sign(tf.reduce_sum(tf.abs(keys), axis=-1))# (N, T_k) 40masks = tf.expand_dims(masks, 1) # (N, 1, T_k) 41masks = tf.tile(masks, [1, tf.shape(queries)[1], 1])# (N, T_q, T_k) 42 43# Apply masks to inputs 44paddings = tf.ones_like(inputs) * padding_num 45outputs = tf.where(tf.equal(masks, 0), paddings, inputs)# (N, T_q, T_k) 46elif type in ("q", "query", "queries"): 47# Generate masks 48masks = tf.sign(tf.reduce_sum(tf.abs(queries), axis=-1))# (N, T_q) 49masks = tf.expand_dims(masks, -1)# (N, T_q, 1) 50masks = tf.tile(masks, [1, 1, tf.shape(keys)[1]])# (N, T_q, T_k) 51 52# Apply masks to inputs 53outputs = inputs*masks 54elif type in ("f", "future", "right"): 55diag_vals = tf.ones_like(inputs[0, :, :])# (T_q, T_k) 56tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()# (T_q, T_k) 57masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(inputs)[0], 1, 1])# (N, T_q, T_k) 58 59paddings = tf.ones_like(masks) * padding_num 60outputs = tf.where(tf.equal(masks, 0), paddings, inputs) 61else: 62print("Check if you entered type correctly!") 63 64 65return outputs

View Code 这里对代码稍作解读，代码里 if type in ("k", "key", "keys"):部分是padding mask，因为Q乘以V，V的序列后面有很长一部分是全零的向量(这就是我们自定义的padding的对应embedding，我们定义为全0)，因此全零的部分我们让attention的权重为一个很小的值-4.2949673e+09。
elif type in ("q", "query", "queries"):部分：类似的，query序列最后面也有可能是一堆padding，不过对queries做padding mask不需要把padding加上一个很小的值，只要将其置零就行，因为outputs是先key mask，再经过softmax，再进行query mask的。
而elif type in ("f", "future", "right"):部分则是我们在做decoder的self attention时要用到的sequence mask，也就是说在每一步，第i个token关注到的attention只有可能是在第i个单词之前的单词，因为它按理来说，看不到后面的单词。作者用一个下三角矩阵来完成这个操作，还是比较巧妙，我简单描述一下每个变量：

文章图片
Context-Attention 也就是论文里提到的Encoder-Decoder Attention，是两个不同序列之间的attention，与来源于自身的 self-attention 相区别。context-attention有很多，这里使用的是scaled dot-product。通过 query 和 key 的相似性程度来确定 value 的权重分布。
【深度学习& 神经网络|Transformer解析与tensorflow代码解读】实际上这部分代码就是self attention用到的QKV的公式的核心代码，不管是Encoder-Decoder Attention还是Self Attention都是用的这里的scaled dot-product方法。

1 def scaled_dot_product_attention(Q, K, V, 2causality=False, dropout_rate=0., 3training=True, 4scope="scaled_dot_product_attention"): 5'''See 3.2.1. 6Q: Packed queries. 3d tensor. [N, T_q, d_k]. 7K: Packed keys. 3d tensor. [N, T_k, d_k]. 8V: Packed values. 3d tensor. [N, T_k, d_v]. 9causality: If True, applies masking for future blinding 10dropout_rate: A floating point number of [0, 1]. 11training: boolean for controlling droput 12scope: Optional scope for `variable_scope`. 13''' 14''' 15查看原论文中3.2.1attention计算公式：Attention(Q,K,V)=softmax(Q K^T /√dk ) V 16:param Q: 查询，三维张量，[N, T_q, d_k]. 17:param K: keys值，三维张量，[N, T_k, d_v]. 18:param V: values值，三维张量，[N, T_k, d_v]. 19:param causality: 布尔值，如果为True，就会对未来的数值进行遮盖 20:param dropout_rate: 0到1之间的一个数值 21:param training: 布尔值，用来控制dropout 22:param scope: 23''' 24with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 25d_k = Q.get_shape().as_list()[-1] 26 27# dot product 28outputs = tf.matmul(Q, tf.transpose(K, [0, 2, 1]))# (N, T_q, T_k) 29 30# scale 31outputs /= d_k ** 0.5 32 33# key masking 34outputs = mask(outputs, Q, K, type="key") 35 36# causality or future blinding masking 37if causality: 38outputs = mask(outputs, type="future") 39 40# softmax 41outputs = tf.nn.softmax(outputs) 42attention = tf.transpose(outputs, [0, 2, 1]) 43tf.summary.image("attention", tf.expand_dims(attention[:1], -1)) 44 45# query masking 46outputs = mask(outputs, Q, K, type="query") 47 48# dropout 49outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=training) 50 51# weighted sum (context vectors) 52outputs = tf.matmul(outputs, V)# (N, T_q, d_v) 53 54return outputs

View Code 这里有个问题：

outputs = tf.nn.softmax(outputs) attention = tf.transpose(outputs, [0, 2, 1])tf.summary.image("attention", tf.expand_dims(attention[:1], -1))

用来干啥的？？为啥要transpose一下？？本来是(N,TQ,Tk)现在到(N,Tk,TQ)
Multi-head attention 多头self attention就是Transoformer的核心，就是用上面提到的QKV公式算出分布之后，用h份合在一起来表示，论文中的h为8。
这部分代码主要是先产生QKV向量，然后按照h头来进行划分，然后调用上面的scaled dot-product的方法来计算的。
另外这里可以看到代码里将8份self attention分别计算后后concat起来了，然后在self attention层后接了残差连接和layer normalization。

1 def multihead_attention(queries, keys, values, 2num_heads=8, 3dropout_rate=0, 4training=True, 5causality=False, 6scope="multihead_attention"): 7'''Applies multihead attention. See 3.2.2 8queries: A 3d tensor with shape of [N, T_q, d_model]. 9keys: A 3d tensor with shape of [N, T_k, d_model]. 10values: A 3d tensor with shape of [N, T_k, d_model]. 11num_heads: An int. Number of heads. 12dropout_rate: A floating point number. 13training: Boolean. Controller of mechanism for dropout. 14causality: Boolean. If true, units that reference the future are masked. 15scope: Optional scope for `variable_scope`. 16 17Returns 18A 3d tensor with shape of (N, T_q, C) 19''' 20''' 21查看原论文中3.2.2中multihead_attention构建， 22这里是将不同的Queries、Keys和values方式线性地投影h次是有益的。 23线性投影分别为dk，dk和dv尺寸。在每个预计版本进行queries、keys、values， 24然后并行执行attention功能，产生dv维输出值。这些被连接并再次投影，产生最终值 25:param queries: 三维张量[N, T_q, d_model] 26:param keys: 三维张量[N, T_k, d_model] 27:param values: 三维张量[N, T_k, d_model] 28:param num_heads: heads数 29:param dropout_rate: 30:param training: 控制dropout机制 31:param causality: 控制是否遮盖 32:param scope: 33:return: 三维张量(N, T_q, C) 34''' 35d_model = queries.get_shape().as_list()[-1] 36with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 37# Linear projections 38Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model) 39K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model) 40V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model) 41 42# Split and concat 43Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h) 44K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h) 45V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h) 46 47# Attention 48outputs = scaled_dot_product_attention(Q_, K_, V_, causality, dropout_rate, training) 49 50# Restore shape 51outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model) 52 53# Residual connection 54outputs += queries 55 56# Normalize 57outputs = ln(outputs) 58 59return outputs

View Code 这里提一句，所有的attention都是用scaled dot-product的方法来计算的，对于self attention来说，Q=K=V，而对于decoder-encoder attention来说，Q=decoder_input，K=V=memory。
Positional Embedding 就目前而言，Transformer 架构还没有提取序列顺序的信息，这个信息对于序列而言非常重要，如果缺失了这个信息，可能我们的结果就是：所有词语都对了，但是无法组成有意义的语句。因此模型对序列中的词语出现的位置进行编码。论文中使用的方法是在偶数位置使用正弦编码，在奇数位置使用余弦编码。

文章图片
代码里有一点，

N, T = tf.shape(inputs)[0], tf.shape(inputs)[1]

position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1]) # (N, T)

outputs = tf.nn.embedding_lookup(position_enc, position_ind)

这里为什么直接用tf.range()之后，建立好了position_enbedding之后直接lookup呢，因为输入的句子顺序本来就是0，1，2，...，T，本来就是顺序输入的。

1 def positional_encoding(inputs, 2maxlen, 3masking=True, 4scope="positional_encoding"): 5'''Sinusoidal Positional_Encoding. See 3.5 6inputs: 3d tensor. (N, T, E) 7maxlen: scalar. Must be >= T 8masking: Boolean. If True, padding positions are set to zeros. 9scope: Optional scope for `variable_scope`. 10 11returns 123d tensor that has the same shape as inputs. 13''' 14''' 15参看论文3.5，由于模型没有循环和卷积，为了让模型知道句子的编号， 16就必须加入某些绝对位置信息，来表示token之间的关系。 17positional encoding和embedding有相同的维度，这两个能够相加。 18:param inputs: 19:param maxlen: 20:param masking: 21:param scope: 22:return: 23''' 24 25E = inputs.get_shape().as_list()[-1] # static 26N, T = tf.shape(inputs)[0], tf.shape(inputs)[1] # dynamic 27with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 28# position indices 29position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1]) # (N, T) 30 31# First part of the PE function: sin and cos argument 32position_enc = np.array([ 33[pos / np.power(10000, (i-i%2)/E) for i in range(E)] 34for pos in range(maxlen)]) 35 36# Second part, apply the cosine to even columns and sin to odds. 37position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])# dim 2i 38position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])# dim 2i+1 39position_enc = tf.convert_to_tensor(position_enc, tf.float32) # (maxlen, E) 40 41# lookup 42outputs = tf.nn.embedding_lookup(position_enc, position_ind) 43 44# masks 45if masking: 46outputs = tf.where(tf.equal(inputs, 0), inputs, outputs) 47 48return tf.to_float(outputs)

View Code 其他一些小模块还有一些小模块比较简单，比如前向网络，前向网络是两层全连接层接一个残差连接和layer normalization。
还用了一个Label Smoothing技术，简单来说就是本来ground truth标签是1的，他改到比如说0.9333，本来是0的，他改到0.0333，这是一个比较经典的平滑技术了。
另外值得注意的是这里用了一个Noam计划衰减学习率，我之前没怎么接触过这种，网上资料也不多，我自己写了个公式：

文章图片

1 def ff(inputs, num_units, scope="positionwise_feedforward"): 2'''position-wise feed forward net. See 3.3 3 4inputs: A 3d tensor with shape of [N, T, C]. 5num_units: A list of two integers. 6scope: Optional scope for `variable_scope`. 7 8Returns: 9A 3d tensor with the same shape and dtype as inputs 10''' 11with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 12# Inner layer 13outputs = tf.layers.dense(inputs, num_units[0], activation=tf.nn.relu) 14 15# Outer layer 16outputs = tf.layers.dense(outputs, num_units[1]) 17 18# Residual connection 19outputs += inputs 20 21# Normalize 22outputs = ln(outputs) 23 24return outputs 25 26 def label_smoothing(inputs, epsilon=0.1): 27'''Applies label smoothing. See 5.4 and https://arxiv.org/abs/1512.00567. 28inputs: 3d tensor. [N, T, V], where V is the number of vocabulary. 29epsilon: Smoothing rate. 30 31For example, 32 33``` 34import tensorflow as tf 35inputs = tf.convert_to_tensor([[[0, 0, 1], 36[0, 1, 0], 37[1, 0, 0]], 38 39[[1, 0, 0], 40[1, 0, 0], 41[0, 1, 0]]], tf.float32) 42 43outputs = label_smoothing(inputs) 44 45with tf.Session() as sess: 46print(sess.run([outputs])) 47 48>> 49[array([[[ 0.03333334,0.03333334,0.93333334], 50[ 0.03333334,0.93333334,0.03333334], 51[ 0.93333334,0.03333334,0.03333334]], 52 53[[ 0.93333334,0.03333334,0.03333334], 54[ 0.93333334,0.03333334,0.03333334], 55[ 0.03333334,0.93333334,0.03333334]]], dtype=float32)] 56``` 57''' 58V = inputs.get_shape().as_list()[-1] # number of channels 59return ((1-epsilon) * inputs) + (epsilon / V) 60 61 62 def noam_scheme(init_lr, global_step, warmup_steps=4000.): 63'''Noam scheme learning rate decay 64init_lr: initial learning rate. scalar. 65global_step: scalar. 66warmup_steps: scalar. During warmup_steps, learning rate increases 67until it reaches init_lr. 68''' 69step = tf.cast(global_step + 1, dtype=tf.float32) 70return init_lr * warmup_steps ** 0.5 * tf.minimum(step * warmup_steps ** -1.5, step ** -0.5)

View Code 作者写的模块内容到这里告一段落，下面分析一些utils代码，data_loader代码以及将这些模块整合的model代码。
uitls代码 1、计算num_batch，就是total_num除以batch_size取整，再加1
2、将int32转为字符串张量(string tensor)
这里需要描述的一点就是用了一个tf.py_func方法，具体作用是它是脱离Graph的，可以用feed_data的方式动态给它喂数据。

1 def convert_idx_to_token_tensor(inputs, idx2token): 2'''Converts int32 tensor to string tensor. 3inputs: 1d int32 tensor. indices. 4idx2token: dictionary 5 6Returns 71d string tensor. 8''' 9def my_func(inputs): 10return " ".join(idx2token[elem] for elem in inputs) 11 12return tf.py_func(my_func, [inputs], tf.string)

View Code 3、postprocess方法用来做翻译后的处理，输入一个是翻译的预测列表，还有一个是id2token的表，就是用查表的方式把数字序列转化成字符序列，从而形成一句可以理解的话。这里注意因为实现文章用的BPE算法来做双字节编码，压缩词表，所以在方法里有专门针对BPE解码的替代，如果做中文数据这个就要改一下了，中文不适用BPE等word piece算法。
4、保存超参数。
5、加载超参数并覆写parser对象。
6、save_variable_specs方法用来保存一些变量的信息，包括变量名，shape，总参数量等等。
7、get_hypotheses方法用来得到预测序列。这个方法就是结合前面的postprocess方法，来生成num_samples个数的有意义的自然语言输出。
8、calc_bleu计算BLEU值。
数据加载方面的代码 1、加载词汇表。param vocab_fpath: 字符串，词文件的地址 0: , 1: , 2: , 3::return: 两个字典，一个是id->token，一个是token->id
2、加载数据load_data。

加载源语和目标语数据，筛除过长的数据，注意是筛除，也就是长度超过maxlen的数据直接丢掉了，没加载进去。

:param fpath1: 源语地址 :param fpath2: 目标语地址 :param maxlen1: 源语句子中最长的长度 :param maxlen2: 目标语句子中最长的长度
3、encode函数用于将字符串转化为数字，这里具体方法是输入的是一个字符序列，然后根据空格切分，然后如果是源语言，则每一句话后面加上“”，如果是目标语言，则在每一句话前面加上“”，后面加上“”，然后再转化成数字序列。如果是中文，这里很显然要改，具体看是字符级别输入还是词语级别输入。

1 def encode(inp, type, dict): 2'''Converts string to number. Used for `generator_fn`. 3inp: 1d byte array. 4type: "x" (source side) or "y" (target side) 5dict: token2idx dictionary 6 7Returns 8list of numbers 9''' 10inp_str = inp.decode("utf-8") 11if type=="x": tokens = inp_str.split() + [""] 12else: tokens = [""] + inp_str.split() + [""] 13 14x = [dict.get(t, dict[""]) for t in tokens] 15return x

View Code 4、generator_fn方法生成训练和评估集数据。这段代码简单讲一下，对于每一个sent1，sent2(源句子，目标句子)，sent1经过前面的encode函数转化成x，sent2经过前面的encode函数转化成y之后，decoder的输入decoder_input是y[:-1]，预期输出y是y[1:]，啥意思呢，就是其实是RNN一样的，用来解码输入的前N-1个，期望的输出是从第2个到第N个，也是N-1个。

1 def generator_fn(sents1, sents2, vocab_fpath): 2'''Generates training / evaluation data 3sents1: list of source sents 4sents2: list of target sents 5vocab_fpath: string. vocabulary file path. 6 7yields 8xs: tuple of 9x: list of source token ids in a sent 10x_seqlen: int. sequence length of x 11sent1: str. raw source (=input) sentence 12labels: tuple of 13decoder_input: decoder_input: list of encoded decoder inputs 14y: list of target token ids in a sent 15y_seqlen: int. sequence length of y 16sent2: str. target sentence 17''' 18token2idx, _ = load_vocab(vocab_fpath) 19for sent1, sent2 in zip(sents1, sents2): 20x = encode(sent1, "x", token2idx) 21y = encode(sent2, "y", token2idx) 22decoder_input, y = y[:-1], y[1:] 23 24x_seqlen, y_seqlen = len(x), len(y) 25yield (x, x_seqlen, sent1), (decoder_input, y, y_seqlen, sent2)

View Code 5、input_fn方法用来生成Batch数据。这段代码其实也比较值得学习，用tf.data.Dataset.from_generator的方式读入数据，不受计算图的影响，比较好。Dataset作为新的API，比以前的feed_dict效率要高一些。关于dataset的简单使用，和一些它代码里用到的API的简单解释，这里有几篇相关博客：
https://blog.csdn.net/googler_offer/article/details/89929657
https://blog.csdn.net/qq_16234613/article/details/81703228
https://blog.csdn.net/Eartha1995/article/details/84930492
这里要非常注意一点！！！！就是这个方法里产生batch，是先repeat()之后，再产生batch数据的，这样会造成最后一个batch如果长度小于batch_size，那么最后几条数据是之前batch里会出现过的，这样做可能会影响到loss的评估！但是作者是怎么做的呢，看他的loss计算公式：

loss = tf.reduce_sum(ce * nonpadding) / (tf.reduce_sum(nonpadding) + 1e-7)

他的loss是把所有非padding的部分的交叉熵保留了下来，加起来，除以非padding序列的长度，但是并没有除以batch_size，也就是算的是一个batch里面的总loss，也就对应了他先repeat()再产生batch数据，也就是每个batch中数据的条目数是相等的，这样就会造成:
训练集和验证集的loss是有问题的！！(稍微有一点点问题)，但是测试集并不是用loss来衡量的，而是用bleu值。可以想象，如果按照这样的方法产生batch数据，测试集合比如说有900条数据，batch size=128，那么测试集会生成1024条数据，但是代码中他取了前900条数据，先写入生成结果，然后计算bleu值，这样是没有问题的。
但是！如果想要把repeat()放到产生batch之前，那么在loss部分最好要除以batch_size，因为这样最后一个batch的loss是天然更小的，会有问题。

1 def input_fn(sents1, sents2, vocab_fpath, batch_size, shuffle=False): 2'''Batchify data 3sents1: list of source sents 4sents2: list of target sents 5vocab_fpath: string. vocabulary file path. 6batch_size: scalar 7shuffle: boolean 8 9Returns 10xs: tuple of 11x: int32 tensor. (N, T1) 12x_seqlens: int32 tensor. (N,) 13sents1: str tensor. (N,) 14ys: tuple of 15decoder_input: int32 tensor. (N, T2) 16y: int32 tensor. (N, T2) 17y_seqlen: int32 tensor. (N, ) 18sents2: str tensor. (N,) 19''' 20shapes = (([None], (), ()), 21([None], [None], (), ())) 22types = ((tf.int32, tf.int32, tf.string), 23(tf.int32, tf.int32, tf.int32, tf.string)) 24paddings = ((0, 0, ''), 25(0, 0, 0, '')) 26 27dataset = tf.data.Dataset.from_generator( 28generator_fn, 29output_shapes=shapes, 30output_types=types, 31args=(sents1, sents2, vocab_fpath))# <- arguments for generator_fn. converted to np string arrays 32 33if shuffle: # for training 34dataset = dataset.shuffle(128*batch_size) 35 36dataset = dataset.repeat()# iterate forever 37dataset = dataset.padded_batch(batch_size, shapes, paddings).prefetch(1) 38 39return dataset

View Code 6、get_batch方法获取batch数据。
整合模型 model.py是模型代码，代码比较短，因为要用到的模块已经在modules.py里面都定义好了。
注意tf.nn.dropout和tf.layers.dropout的区别：https://blog.csdn.net/Bruce_Wang02/article/details/81036796
另外还有一点就是他把所有的输入向量按照一个比例进行了缩放，具体看

dec *= self.hp.d_model ** 0.5

可以看到是将向量的所有维度都扩了根号d_model倍，我目前不知道这样做的意义，先占个位置。
还有一点，logits = tf.einsum('ntd,dk->ntk', dec, weights)，对于tf.einsum的用法，这里有个简单的描述：https://blog.csdn.net/qq_35203425/article/details/81560118
这里有个详细的：https://www.jqr.com/article/000481
损失函数：loss = tf.reduce_sum(ce * nonpadding) / (tf.reduce_sum(nonpadding) + 1e-7)

1 class Transformer: 2''' 3xs: tuple of 4x: int32 tensor. (N, T1) 5x_seqlens: int32 tensor. (N,) 6sents1: str tensor. (N,) 7ys: tuple of 8decoder_input: int32 tensor. (N, T2) 9y: int32 tensor. (N, T2) 10y_seqlen: int32 tensor. (N, ) 11sents2: str tensor. (N,) 12training: boolean. 13''' 14def __init__(self, hp): 15self.hp = hp 16self.token2idx, self.idx2token = load_vocab(hp.vocab) 17self.embeddings = get_token_embeddings(self.hp.vocab_size, self.hp.d_model, zero_pad=True) 18 19def encode(self, xs, training=True): 20''' 21Returns 22memory: encoder outputs. (N, T1, d_model) 23''' 24with tf.variable_scope("encoder", reuse=tf.AUTO_REUSE): 25x, seqlens, sents1 = xs 26 27# embedding 28enc = tf.nn.embedding_lookup(self.embeddings, x) # (N, T1, d_model) 29enc *= self.hp.d_model**0.5 # scale 30 31enc += positional_encoding(enc, self.hp.maxlen1) 32enc = tf.layers.dropout(enc, self.hp.dropout_rate, training=training) 33 34## Blocks 35for i in range(self.hp.num_blocks): 36with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE): 37# self-attention 38enc = multihead_attention(queries=enc, 39keys=enc, 40values=enc, 41num_heads=self.hp.num_heads, 42dropout_rate=self.hp.dropout_rate, 43training=training, 44causality=False) 45# feed forward 46enc = ff(enc, num_units=[self.hp.d_ff, self.hp.d_model]) 47memory = enc 48return memory, sents1 49 50def decode(self, ys, memory, training=True): 51''' 52memory: encoder outputs. (N, T1, d_model) 53 54Returns 55logits: (N, T2, V). float32. 56y_hat: (N, T2). int32 57y: (N, T2). int32 58sents2: (N,). string. 59''' 60with tf.variable_scope("decoder", reuse=tf.AUTO_REUSE): 61decoder_inputs, y, seqlens, sents2 = ys 62 63# embedding 64dec = tf.nn.embedding_lookup(self.embeddings, decoder_inputs)# (N, T2, d_model) 65dec *= self.hp.d_model ** 0.5# scale 66 67dec += positional_encoding(dec, self.hp.maxlen2) 68dec = tf.layers.dropout(dec, self.hp.dropout_rate, training=training) 69 70# Blocks 71for i in range(self.hp.num_blocks): 72with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE): 73# Masked self-attention (Note that causality is True at this time) 74dec = multihead_attention(queries=dec, 75keys=dec, 76values=dec, 77num_heads=self.hp.num_heads, 78dropout_rate=self.hp.dropout_rate, 79training=training, 80causality=True, 81scope="self_attention") 82 83# Vanilla attention 84dec = multihead_attention(queries=dec, 85keys=memory, 86values=memory, 87num_heads=self.hp.num_heads, 88dropout_rate=self.hp.dropout_rate, 89training=training, 90causality=False, 91scope="vanilla_attention") 92### Feed Forward 93dec = ff(dec, num_units=[self.hp.d_ff, self.hp.d_model]) 94 95# Final linear projection (embedding weights are shared) 96weights = tf.transpose(self.embeddings) # (d_model, vocab_size) 97logits = tf.einsum('ntd,dk->ntk', dec, weights) # (N, T2, vocab_size) 98y_hat = tf.to_int32(tf.argmax(logits, axis=-1)) 99 100return logits, y_hat, y, sents2 101 102def train(self, xs, ys): 103''' 104Returns 105loss: scalar. 106train_op: training operation 107global_step: scalar. 108summaries: training summary node 109''' 110# forward 111memory, sents1 = self.encode(xs) 112logits, preds, y, sents2 = self.decode(ys, memory) 113 114# train scheme 115y_ = label_smoothing(tf.one_hot(y, depth=self.hp.vocab_size)) 116ce = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=y_) 117nonpadding = tf.to_float(tf.not_equal(y, self.token2idx[""]))# 0: 118# 测试一下******************************************** 119print(tf.reduce_sum(nonpadding)) 120# ******************************************************** 121loss = tf.reduce_sum(ce * nonpadding) / (tf.reduce_sum(nonpadding) + 1e-7) 122 123global_step = tf.train.get_or_create_global_step() 124lr = noam_scheme(self.hp.lr, global_step, self.hp.warmup_steps) 125optimizer = tf.train.AdamOptimizer(lr) 126train_op = optimizer.minimize(loss, global_step=global_step) 127 128tf.summary.scalar('lr', lr) 129tf.summary.scalar("loss", loss) 130tf.summary.scalar("global_step", global_step) 131 132summaries = tf.summary.merge_all() 133 134return loss, train_op, global_step, summaries 135 136def eval(self, xs, ys): 137'''Predicts autoregressively 138At inference, input ys is ignored. 139Returns 140y_hat: (N, T2) 141''' 142decoder_inputs, y, y_seqlen, sents2 = ys 143 144decoder_inputs = tf.ones((tf.shape(xs[0])[0], 1), tf.int32) * self.token2idx[""] 145ys = (decoder_inputs, y, y_seqlen, sents2) 146 147memory, sents1 = self.encode(xs, False) 148 149logging.info("Inference graph is being built. Please be patient.") 150for _ in tqdm(range(self.hp.maxlen2)): 151logits, y_hat, y, sents2 = self.decode(ys, memory, False) 152if tf.reduce_sum(y_hat, 1) == self.token2idx[""]: break 153 154_decoder_inputs = tf.concat((decoder_inputs, y_hat), 1) 155ys = (_decoder_inputs, y, y_seqlen, sents2) 156 157# monitor a random sample 158n = tf.random_uniform((), 0, tf.shape(y_hat)[0]-1, tf.int32) 159sent1 = sents1[n] 160pred = convert_idx_to_token_tensor(y_hat[n], self.idx2token) 161sent2 = sents2[n] 162 163tf.summary.text("sent1", sent1) 164tf.summary.text("pred", pred) 165tf.summary.text("sent2", sent2) 166summaries = tf.summary.merge_all() 167 168return y_hat, summaries

View Code 还有一些值得说明的地方：我们会发现作者在train()方法里的代码，解码器的输入只用了一次输入，然后利用下三角的方法完成每一次的sequence mask，但是在eval()方法里却按照序列长度分次输入，如果序列长度是100，则跑了100次decoder，一开始decoder_inputs的输入只有开始符，后来每一次多一个token。这样做是为了方便在做inference的时候也能调用这个eval()方法。
一些疑问(暂时未解决)： 1、为什么生成Q、K、V的dense层选择不用偏置use_bias=False
参考博客 https://blog.csdn.net/u012526436/article/details/86295971
https://www.jianshu.com/p/6670f775625f
https://blog.csdn.net/ustbbsy/article/details/79564828
https://blog.csdn.net/googler_offer/article/details/89929657
https://blog.csdn.net/qq_16234613/article/details/81703228
https://blog.csdn.net/Eartha1995/article/details/84930492
https://blog.csdn.net/Bruce_Wang02/article/details/81036796
https://blog.csdn.net/qq_35203425/article/details/81560118
https://www.jqr.com/article/000481