【CNN基础】Attention机制的梳理（一）——What is Attention in NLP（） CNN基础

Attention机制梳理（一）——What is Attention in NLP？
Attention机制梳理（二）——How do Attention derive BERT？
Attention机制梳理（三）——What is Attention in CV？
Attention机制梳理（四）——How to conbine Attention in both NLP and CV？

文章目录

〇、带着问题上路

NLP的一些基本概念，如分词 (Tokenization)、词干提取 (Stemming)
Attention的推导

1. query, key, value的理解，跟self-attention的关系？
2. Attention内部mask的物理含义
3. Multi-Head Attention是如何引入的？
4. Attention到底是什么？

Representing The Order of The Sequence Using Positional Encoding

5. src_mask的制作和理解
decoder的shifted right？
decoder的具体训练过程？是否可以并行？它的input是挨个词还是直接进去一个矩阵？
6. `class Batch:`中src和tgt的shape跟src_mask的shape不一致，光看ipynb代码不太好追溯，要用pycharm去debug一下就能很好的理解了
7. Positional Encoding的理解
8. `class Batch:`self.trg和self.trg_y的含义
9. Attention提出的意义在哪里，为什么可以解决数据并行化的问题
10. 输入序列和输出序列中任意位置组合之间的这些路径越短，学习长期依赖关系就越容易。这是啥？

一、Network的理解
二、Dataset的理解

2.1 Data Loading

2.1.1 安装数据：
2.1.2 读取数据
2.1.3 数据形式

2.2 Make Batch
2.3 Synthetic Data

三、Regularization的理解(或Loss的理解)

3.1 Residual Dropout
3.2 Label Smoothing

3.2.1 参数设置
3.2.2 Label Smoothing的理解
3.2.3 示例
3.2.4 代码

四、Optimizer部分学习率的理解

4.1 原文：
4.2 解释
4.3 推导
4.4 示例
4.5 代码

五、Results
六、Examples

6.1 A First Example
6.2 A Real World Example

七、Additional Components: BPE, Search, Averaging

Attention Visualization

本文用到的参考资料：

The Annotated Transformer：Attention代码介绍，网上很多翻译版本都是基于此文章得到的
The Annotated Transformer全文翻译：The Annotated Transformer的翻译版本
Models of The Annotated Transformer：The Annotated Transformer中用到的Model
The Illustrated Transformer：详细地用图片展示了transformer模型的细节，有助于加深对模型的理解
tensorflow-attention is all you need：A TensorFlow Implementation of Attention Is All You Need
Attention Is All You Need全文翻译：Attention Is All You Need的中文翻译
深度学习中的注意力机制：深入浅出阐释什么是Attention，讲的很好！
[论文笔记]Attention is All You Need：一位同事总结的，有自己一定的理解。
深度学习：transformer模型：对论文中的图表加入了自己的理解，可以用浅显的语言讲清楚，同时也会给重点部分加入代码，辅助理解。

〇、带着问题上路

这里列出自己在看论文撸代码时碰到一些问题，希望帮助到跟我一样的NLP小白用户，带着问题上路，更有助于思考

NLP的一些基本概念，如分词 (Tokenization)、词干提取 (Stemming) 《自然语言处理(NLP)的基本概念》
Attention的推导 1. query, key, value的理解，跟self-attention的关系？

【CNN基础】Attention机制的梳理（一）——What is Attention in NLP（）

文章图片
??还可以从另一个角度看Attention，那就是键值查询。键值查询应该有三个基本元素：索引（Query），键（Key）和值（Value），你可以理解为这是一个查字典的过程，Key-Value对构成一个字典，用户给一个Query，系统找到与之相同的Key，返回对应的Value。那么问题来了，字典里没有与Query相同的Key怎么办？答案是分别计算Query和每一个已有的Key的相似度 w w w，作为权重分配到所有的Value上，并返回它们的加权求和。对应到上面机器翻译的例子，输出序列的局部信息是Query，输入序列的局部信息是Key， w w w是二者的相似度，而Value设为1即可。从上面的分析看出，Attention也可以理解为某种相似性度量。（引用自《深度学习中的注意力机制》中“Attention Mechanism”章节中“键值查询”的介绍。深入浅出，值得学习。）

文章图片

文章图片
??在看上面两张图，结合《Transformer模型笔记》中“2. 细节: Multi-Head Attention 与 Scaled Dot-Product Attention”的query, key, value介绍。右图输入input由n个tokens（分词）构成，经过线性变换得到n个Embedding向量，这几个向量分别跟 W O W^{O} WO、 W K W^{K} WK、 W V W^{V} WV相乘得到左图的Q、K、V，Q与K相乘再经过Scale、Mask和SoftMax操作得到相似度得分 w w w ，作为权重分配到对应的V上，并返回它们的加权求和。前文提到，“对应到上面机器翻译的例子，输出序列的局部信息是Query，输入序列的局部信息是Key， w w w是二者的相似度”，而这里的Q和K对应的都是输入序列的局部信息，因此这种Attention可以理解为Self-Attention，这样encoder的每个位置都能去关注前一层encoder输出的所有位置，最终学习的是不同句子内部的联系（语法结构等）。引入Self-Attention的好处在于可以在O(1) 的代价联系序列中两个长期依赖的特征，对于RNN结构可能需要累积更多的时间步骤才能反应过来，因此Self-Attention能够提升网络的可并行性。
2. Attention内部mask的物理含义
?? 第一个问题“query, key, value的理解” 解决了Q、K、V的困惑，那么，Scaled Dot-Product Attention结构中的Mask又有什么作用呢？这里参考Transformer模型笔记中“attention map”的介绍，来回答一下这个问题。

文章图片

文章图片
??注意Attention( Q , K , V ) = softmax ? ( Q K T d k ) V \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) VAttention (Q,K,V)=softmax(dk? ?QKT?)V这个公式，Q K T Q K^{T} QKT其实就会组成一个word2word的attention map！(加了softmax之后就是一个和为1的权重了)。比如说你的输入是一句话 “i have a dream” 总共4个单词，这里就会形成一张4x4的注意力机制的图（或者NxN的Attention Map，N表示序列的长度或者分词的个数）。注意encoder里面是叫self-attention（应该是未使用Mask），decoder里面是叫masked self-attention，这里的masked就是要在做language modelling（或者像翻译）的时候，不给模型看到未来的信息。
??具体地， I I I作为第一个单词，只能有和 I I I自己的attention。 h a v e have have作为第二个单词，有和 I I I、 h a v e have have两个attention。a a a作为第三个单词，有和 I I I、 h a v e have have、 a a a前面三个单词的attention。到了最后一个单词 d r e a m dream dream的时候，才有对整个句子4个单词的attention。
3. Multi-Head Attention是如何引入的？

文章图片

文章图片
??前两个问题就Scaled Dot-Product Attention中的Q、K、V和Mask做了详细的分析，接下来，我们分析一下Multi-Head Attention是如何引入的？
a. self-attention中，如果输入的句子特别长，那就为形成一个 NxN的attention map，这就会导致内存爆炸。
??为此，文章提出使用Multi-Head Attention机制来提升Attention的性能，具体表现在两个方面：

扩展了模型关注不同位置的能力，例如上面左图** z 1 \mathbf{z}_{1} z1?都包含了其对应的Thinking**之外其他单词的encoding信息，但主要还是包含了自身单词本身的信息。Multi-Head的引入，是的不同的Head可以关注不同位置的信息，从而达到扩展了模型关注不同位置的能力的目的。例如， “The animal didn’t cross the street because it was too tired”这个句子，我们往往会想知道“it”到底指代什么东西。
Multi-Head Attention让attention layer可以表达出多个表示层子空间。例如，Transformer会使用8个attention head的8组Query/Key/Value权重矩阵对同一个输入进行处理，其中每一组矩阵采用随机初始化。经过训练后，每一组权重矩阵都会将input embedding投射到不同的表示层子空间。

b. 问题又来了：feed-forward层并不希望有8个矩阵作为输入，这时候该怎么将{ z 0 \mathbf{z}_{0} z0?，…， z 7 \mathbf{z}_{7} z7?}压缩一下呢？
??如上面右图图所示，采用 W o W^{o} Wo乘上concat后的矩阵{ z 0 \mathbf{z}_{0} z0?，…， z 7 \mathbf{z}_{7} z7?}，得到 z \mathbf{z} z即可.
c. 下面，我们来看看Multi-Head Attention的完整过程：

文章图片
d. 本文使用的是Multi-Head Attention，具体体现在三个方面。

decoder的encoder-decoder attention层，query为上一层decoder的输出，key和value来自encoder的输出。它又可以学习到输入在所有位置上的信息。
encoder包含self-attention层，在self-attention层中所有的key、value和query都来自前一层的encoder。这样encoder的每个位置都能去关注前一层encoder输出的所有位置。
decoder包含self-attention层，其功能跟encoder中的self-attention类似。且decoder中的Masked Multi-head Attention层，利用前面decoder block中所有位置上的信息，得到query。为了让我们的query仅从前面的已知的词得出，完全不受后面词的影响，可以加一个mask，也就是把矩阵中对应位置的结果设置为 ∞ \infty ∞。比如，输入“我爱喝可乐”，在翻译出“I like”之后，attention层根据“I like”学习出query(next-token probability)(比如后面应该接一个名词)，从encoder结果中学习出key和value（比如喝）。此时mask避免了后面“drink cola”对这一级的干扰。（这个例子举得不是很好，没有把Masked Multi-head Attention当中Mask的作用讲清楚，应该要结合“2. Attention内部mask的物理含义”中“I have a dream”的例子去理解）（引用自[论文笔记]Attention is All You Need）

4. Attention到底是什么？
深度学习中的注意力机制对attention理解的非常到位，这里忍不住引用过来，记录一下，关于更多细节，请跳转至原作者的博客。

首先，从数学公式上和代码实现上Attention可以理解为加权求和。
其次，从形式上Attention可以理解为键值查询。
最后，从物理意义上Attention可以理解为相似性度量。

Representing The Order of The Sequence Using Positional Encoding “变形金刚”为何强大：从模型到代码全面解析Google Tensor2Tensor系统
深度学习：transformer模型
a. Positional Encoding要解决什么问题？
??到目前为止，我们的Transformer模型还不具备捕捉输入序列中单词顺序的能力。Self-Attention机制建模序列的方式，既不是RNN的时序观点，也不是CNN的结构化观点，而是一种词袋（bag of words）的观点。进一步阐述的话，应该说该机制视一个序列为扁平的结构，因为不论看上去距离多远的词，在self-attention机制中都为1。这样的建模方式，实际上会丢失词之间的相对距离关系。举个例子就是，“牛吃了草”、“草吃了牛”，“吃了牛草”三个句子建模出来的每个词对应的表示，会是一致的，也就是说无论句子的结构怎么打乱，Transformer都会得到类似的结果。
b. 如何用Position Vector来表征序列单词的顺序呢？
??为了解决“Transformer模型不具备捕序列捉顺序的能力”的问题，transformer会以input embedding w = ( w 1 , … , w m ) \mathbf{w}=\left(w_{1}, \dots, w_{m}\right) w=(w1?,…,wm?)作为输入，让模型学习出某种特殊表征，得到一个Position Vector p = ( p 1 , … , p m ) \mathbf{p}=\left(p_{1}, \dots, p_{m}\right) p=(p1?,…,pm?)，直觉告诉，最简单的方式是，通过加和得到一个input element的表征向量 e = ( w 1 + p 1 , … , w m + p m ) \mathbf{e}=\left(w_{1}+p_{1}, \ldots, w_{m}+p_{m}\right) e=(w1?+p1?,…,wm?+pm?)。如下图所示：

文章图片

文章图片
??上面提到的`Position Vector`，究竟表征的是啥呢？ - position of each input elements or words - distance between different words in the sequence ??至此，我们熟悉了positional embedding的通用定义，更多细节请参考文章：（Convolutional Sequence to Sequence Learning）
c. paper中的Positional Encoding又是如何得到Position Vector的呢？
??首先，看一下Position Vector究竟长什么样？
??下图展示了由20个单词通过positional encoding得到的Position Vector p = ( p 1 , … , p m ) \mathbf{p}=\left(p_{1}, \dots, p_{m}\right) p=(p1?,…,pm?)，这是一个20x512的矩阵，20行分别对应20个不同的单词，图中每行都表示对应单词通过positional encoding得到的对应20行Position Vector，其embedding size=512，值域为[-1,1]。

文章图片
??其次，为什么这20个单词的Position Vector Matrix在中间看起来断裂了呢？
??这是因为左边是通过sine函数产生，右边是通过cosine函数产生。这里难以理解的一个点是，横坐标仅仅表示有512个位置，但并不是跟position一一对应，即左图是用 d m o d e l = 2 i d_{m o d e l}=2i dmodel?=2i的位置，通过sin函数得到，右图是用 d m o d e l = 2 i + 1 d_{m o d e l}=2i+1 dmodel?=2i+1的位置，通过cosine函数得到，最后再将这两个图片表示的向量拼接到一起，而非按照 1 , 2 , . . . , 512 {1,2,...,512} 1,2,...,512这样顺序排列。，另一种解释是“还需要指出的是，论文中根据维度下标的奇偶性来交替使用sin和cos函数的说法，在代码中并不是这样实现的，而是前一半的维度使用sin函数，后一半的维度使用cos函数，并没有考虑奇偶性”（引用自Tensor2Tensor系统解析）。
??最后，给出得到Position Vector的计算公式：
P E ( p o s , 2 i ) = sin ? ( pos ? / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = cos ? ( p o s / 1000 0 2 i / d m o d e l ) \begin{array}{l}{P E_{(p o s, 2 i)}=\sin \left(\operatorname{pos} / 10000^{2 i / d_{m o d e l}}\right)} \\ {P E_{(p o s, 2 i+1)}=\cos \left(p o s / 10000^{2 i / d_{m o d e l}}\right)}\end{array} PE(pos,2i)?=sin(pos/100002i/dmodel?)PE(pos,2i+1)?=cos(pos/100002i/dmodel?)?
??其中， p o s pos pos是word所在位置， i i i表示单词的维度， d m o d e l d_model dm?odel表示embedding维度512。当然了，最后别忘了加和操作 e = ( w 1 + p 1 , … , w m + p m ) \mathbf{e}=\left(w_{1}+p_{1}, \ldots, w_{m}+p_{m}\right) e=(w1?+p1?,…,wm?+pm?)
d. 为什么sin和cos可以表征位置信息呢？
??

该公式的设计非常先验，尤其是分母部分，不太好解释。从笔者个人的观点来看，一方面，三角函数有很好的周期性，也就是隔一定的距离，因变量的值会重复出现，这种特性可以用来建模相对距离；另一方面，三角函数的值域是[-1,1]，可以很好的提供embedding元素的值。（引用自Tensor2Tensor系统解析）
任意位置的 P E p o s + k PE_{pos+k} PEpos+k?都可以被 P E p o s PE_{pos} PEpos?的线性函数表示。考虑到在NLP任务中，除了单词的绝对位置，单词的相对位置也非常重要。根据公式s i n ( α + β ) = s i n α c o s β + c o s α s i n β sin(\alpha+\beta) = sin \alpha cos \beta + cos \alpha sin\beta sin(α+β)=sinαcosβ+cosαsinβ以及 c o s ( α + β ) = c o s α c o s β ? s i n α s i n β cos(\alpha + \beta) = cos \alpha cos \beta - sin \alpha sin\beta cos(α+β)=cosαcosβ?sinαsinβ，这表明位置 k+p 的位置向量可以表示为位置 k 的特征向量的线性变化，这为模型捕捉单词之间的相对位置关系提供了非常大的便利，即可以表征位置信息
如果是学习到的positional embedding，可能会像词向量一样受限于词典大小。也就是只能学习到“位置2对应的向量是(1,1,1,2)”这样的表示。所以用三角公式明显不受序列长度的限制，也就是可以对“比所遇到序列的更长的序列”进行表示。(引用自深度学习：transformer模型)

??关于位置编码，作者还尝试了learned positional embedding的方法，所得结果几乎相同。作者最终选择了这种正弦曲线编码的方式是因为，这种方式还适用于test中句子比train中长的情况。(在BERT中使用的是learn的方法)。（引用自[论文笔记]Attention is All You Need）

这里记录一个未理解到位的问题：为什么positional embedding受限于词典大小，而三角公式明显不受序列长度的限制，可以对“比所遇到序列的更长的序列”进行表示？

e. Position Encoding的具体代码实现，过几天再理解一下
??关于位置编码的实现可在Google开源的算法中get_timing_signal_1d()函数和 The Annotated Transformer找到对应的代码，这里摘录如下，注释有删减：

# tensorflow version def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4, start_index=0): position = tf.to_float(tf.range(length) + start_index) num_timescales = channels // 2 log_timescale_increment = ( math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1)) inv_timescales = min_timescale * tf.exp( tf.to_float(tf.range(num_timescales)) * -log_timescale_increment) scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0) signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1) signal = tf.pad(signal, [[0, 0], [0, tf.mod(channels, 2)]]) signal = tf.reshape(signal, [1, length, channels]) return signal# pytorch version class PositionalEncoding(nn.Module): "Implement the PE function." def __init__(self, d_model, dropout, max_len=5000): super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout)# Compute the positional encodings once in log space. pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe)def forward(self, x): x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) return self.dropout(x)

5. src_mask的制作和理解
decoder的shifted right？
decoder的具体训练过程？是否可以并行？它的input是挨个词还是直接进去一个矩阵？
6. class Batch:中src和tgt的shape跟src_mask的shape不一致，光看ipynb代码不太好追溯，要用pycharm去debug一下就能很好的理解了
7. Positional Encoding的理解
8. class Batch:self.trg和self.trg_y的含义
9. Attention提出的意义在哪里，为什么可以解决数据并行化的问题
10. 输入序列和输出序列中任意位置组合之间的这些路径越短，学习长期依赖关系就越容易。这是啥？
一、Network的理解

文章图片

文章图片

二、Dataset的理解 2.1 Data Loading 2.1.1 安装数据：

#!pip install torchtext spacy #!python -m spacy download en #!python -m spacy download de

2.1.2 读取数据

# For data loading. from torchtext import data, datasetsif True: import spacy spacy_de = spacy.load('de') spacy_en = spacy.load('en')def tokenize_de(text): return [tok.text for tok in spacy_de.tokenizer(text)]def tokenize_en(text): return [tok.text for tok in spacy_en.tokenizer(text)]BOS_WORD = '' EOS_WORD = '' BLANK_WORD = "" SRC = https://www.it610.com/article/data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD) TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, eos_token = EOS_WORD, pad_token=BLANK_WORD)MAX_LEN = 100 train, val, test = datasets.IWSLT.splits( exts=('.de', '.en'), fields=(SRC, TGT), filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN) MIN_FREQ = 2 SRC.build_vocab(train.src, min_freq=MIN_FREQ) TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

2.1.3 数据形式
之前看代码的时候一直困惑数据到底长啥样，debug模式下终于拨开了云雾。

文章图片

2.2 Make Batch

class Batch: "Object for holding a batch of data with mask during training." def __init__(self, src, trg=None, pad=0): self.src = https://www.it610.com/article/src self.src_mask = (src != pad).unsqueeze(-2) if trg is not None: self.trg = trg[:, :-1] self.trg_y = trg[:, 1:] self.trg_mask = / self.make_std_mask(self.trg, pad) self.ntokens = (self.trg_y != pad).data.sum()@staticmethod def make_std_mask(tgt, pad):"Create a mask to hide padding and future words." tgt_mask = (tgt != pad).unsqueeze(-2) tgt_mask = tgt_mask & Variable( subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data)) return tgt_maskdef rebatch(pad_idx, batch): "Fix order in torchtext to match ours" src, trg = batch.src.transpose(0, 1), batch.trg.transpose(0, 1) return Batch(src, trg, pad_idx) #调用Batch制作数据

2.3 Synthetic Data 这里通过Fake数据的合成，加深对NLP数据格式的理解

def data_gen(V, batch, nbatches): "Generate random data for a src-tgt copy task." for i in range(nbatches): data = https://www.it610.com/article/torch.from_numpy(np.random.randint(1, V, size=(batch, 10))) data[:, 0] = 1 src = Variable(data, requires_grad=False) tgt = Variable(data, requires_grad=False) yield Batch(src, tgt, 0)

三、Regularization的理解(或Loss的理解) 我们在训练中采用三种正则化方法。
3.1 Residual Dropout 我们将dropout应用于每个子层的输出，然后将其添加到子层输入，并进行正则化。我们还将dropout应用于编码器和解码器堆栈中嵌入和位置编码的总和。对于基本的模型，我们使用 P d r o p = 0.1 P_{d r o p}=0.1 Pdrop?=0.1。
3.2 Label Smoothing

这里LabelSmoothing求的是KL散度值。

3.2.1 参数设置
During training, we employed label smoothing of value? l s = 0.1 \epsilon_{ls}=0.1 ?ls?=0.1 (cite). This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary.

3.2.2 Label Smoothing的理解

Label smoothing actually starts to penalize the model if it gets very confident about a given choice.

label_smoothing的理解
label函数图理解
label_smoothing的paper

3.2.3 示例
这里的图片不是很理解???

#Example of label smoothing. crit = LabelSmoothing(5, 0, 0.4) predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0], [0, 0.2, 0.7, 0.1, 0], [0, 0.2, 0.7, 0.1, 0]]) v = crit(Variable(predict.log()), Variable(torch.LongTensor([2, 1, 0])))# Show the target distributions expected by the system. plt.imshow(crit.true_dist) None

文章图片

crit = LabelSmoothing(5, 0, 0.1) def loss(x): d = x + 3 * 1 predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d], ]) #print(predict) return crit(Variable(predict.log()), Variable(torch.LongTensor([1]))).data[0] plt.plot(np.arange(1, 100), [loss(x) for x in range(1, 100)]) None

文章图片

3.2.4 代码

# pytorch version class LabelSmoothing(nn.Module): "Implement label smoothing." def __init__(self, size, padding_idx, smoothing=0.0): super(LabelSmoothing, self).__init__() self.criterion = nn.KLDivLoss(size_average=False) self.padding_idx = padding_idx self.confidence = 1.0 - smoothing self.smoothing = smoothing self.size = size self.true_dist = Nonedef forward(self, x, target): assert x.size(1) == self.size true_dist = x.data.clone() true_dist.fill_(self.smoothing / (self.size - 2)) true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence) true_dist[:, self.padding_idx] = 0 mask = torch.nonzero(target.data =https://www.it610.com/article/= self.padding_idx) if mask.dim()> 0: true_dist.index_fill_(0, mask.squeeze(), 0.0) self.true_dist = true_dist return self.criterion(x, Variable(true_dist, requires_grad=False)) # tensorflow version # please refer to https://github.com/tensorflow/cleverhans/blob/f70ca7e000dadd6ace5aeff15bba0e960e8c1384/cleverhans_tutorials/mnist_tutorial_tf.py#L126

四、Optimizer部分学习率的理解 4.1 原文： We used the Adam optimizer (cite) withβ 1 = 0.9 \beta_1=0.9 β1?=0.9,β 2 = 0.98 \beta_2=0.98 β2?=0.98 and? = 1 0 ? 9 \epsilon=10^{-9} ?=10?9. We varied the learning rate over the course of training, according to the formula:
lrate = dmodel? 0.5 ? min ? ( ste ? p ? n u m ? 0.5 ,ste p ? n u m ? warmu ? p ? steps ? 1.5 ) \text {lrate}=d_{\text { model }}^{-0.5} \cdot \min \left(\operatorname{ste}p_{-} n u m^{-0.5}, \text { ste} p_{-} n u m \cdot \operatorname{warmu} p_{-} \text {steps}^{-1.5}\right) lrate=d model ?0.5??min(step??num?0.5, step??num?warmup??steps?1.5)This corresponds to increasing the learning rate linearly for the firstw a r m u p _ s t e p s warmup\_steps warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We usedw a r m u p _ s t e p s = 4000 warmup\_steps=4000 warmup_steps=4000.
4.2 解释我们使用了Adam优化器， β 1 = 0.9 \beta_1=0.9 β1?=0.9,β 2 = 0.98 \beta_2=0.98 β2?=0.98 and? = 1 0 ? 9 \epsilon=10^{-9} ?=10?9。根据公式，我们在整个训练过程中改变了学习率。根据公式，我们在整个训练过程中改变了学习率。这对应于在第一个warmup_steps 训练steps中线性的增加学习率，然后与步数的平方成比例地减少学习率。
4.3 推导令 ste ? p ? n u m ? 0.5 > ste p ? n u m ? warmup ? ? steps ? 1.5 \operatorname{ste}p_{-} n u m^{-0.5} > \text { ste} p_{-} n u m \cdot \operatorname{warmup}_{-} \text {steps}^{-1.5} step??num?0.5> step??num?warmup??steps?1.5，推出 warmu ? p ? steps 3 / 2 > ste ? p ? n u m 3 / 2 \operatorname{warmu} p_{-} \text {steps}^{3/2} > \operatorname{ste}p_{-}n u m^{3/2} warmup??steps3/2>step??num3/2 => warmup ? ? steps > ste ? p ? n u m \operatorname{warmup}_{-} \text {steps} > \operatorname{ste}p_{-} n u m warmup??steps>step??num，即
lrate = { dmodel? 0.5 ?ste p ? n u m ? warmup ? ? steps ? 1.5 ,ifste ? p ? n u m < warmup ? ? steps dmodel? 0.5 ? ste ? p ? n u m ? 0.5 ,ifste ? p ? n u m < warmup ? ? steps \text {lrate}= \left\{\begin{array}{l}{d_{\text { model }}^{-0.5} \cdot\text { ste} p_{-} n u m \cdot \operatorname{warmup}_{-} \text {steps}^{-1.5}, \text { if } \operatorname{ste}p_{-} n u m < \operatorname{warmup}_{-} \text {steps} } \\ {d_{\text { model }}^{-0.5} \cdot \operatorname{ste}p_{-} n u m^{-0.5} , \text { if } \operatorname{ste}p_{-} n u m < \operatorname{warmup}_{-} \text {steps}} \end{array}\right. lrate={d model ?0.5?? step??num?warmup??steps?1.5, if step??num 4.4 示例

文章图片

4.5 代码

# pytorch version # refer to "Optimizer" part of https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb def rate(self, step = None): "Implement `lrate` above" if step is None: step = self._step return self.factor * \ (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5)))# tensorflow version # refer to https://github.com/Kyubyong/transformer/blob/6715edcb79022b1a92ba7b9edd1b3c6b53cebf28/modules.py#L303 def noam_scheme(init_lr, global_step, warmup_steps=4000.): '''Noam scheme learning rate decay init_lr: initial learning rate. scalar. global_step: scalar. warmup_steps: scalar. During warmup_steps, learning rate increases until it reaches init_lr. ''' step = tf.cast(global_step + 1, dtype=tf.float32) return init_lr * warmup_steps ** 0.5 * tf.minimum(step * warmup_steps ** -1.5, step ** -0.5)

五、Results

the Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.

WMT 2014 English-to-German任务上，提升2.0 BLEU，到state-of-the-art的28.4
即便base model也超越了之前所有的方法
WMT 2014 English-to-French任务上，big model提升到41.8，超出之前所有公开的single model方法，且只花了原先的state-of-the-art方法1/4的时间

文章图片

六、Examples 6.1 A First Example We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols.
这里包含

Synthetic Data：Introduce how to generate fake data.
Loss Computation：Introduce how to calcurate loss function.
Greedy Decoding：This code predicts a translation using greedy decoding for simplicity.

For more information, prelase refer to https://github.com/harvardnlp/annotated-transformer/blob/master/The Annotated Transformer.ipynb
6.2 A Real World Example

Data Loading: We will load the dataset using torchtext and spacy for tokenization.
Iterators
Multi-GPU Training
Training the System
For more information, prelase refer to https://github.com/harvardnlp/annotated-transformer/blob/master/The Annotated Transformer.ipynb

七、Additional Components: BPE, Search, Averaging 【【CNN基础】Attention机制的梳理（一）——What is Attention in NLP（）】介绍基于OpenNMT实现的transformer模型的其他特性：

BPE/ Word-piece
Shared Embeddings
Beam Search
Model Averaging
For more information, prelase refer to https://github.com/harvardnlp/annotated-transformer/blob/master/The Annotated Transformer.ipynb

Attention Visualization