Transformer模型详解及代码实现

一，Transformer 输入

Transformer 中单词的输入表明 x 由单词 Embedding 和方位 Embedding （Positional Encoding）相加得到，通常界说为 TransformerEmbedding 层，其代码完结如下所示:

1.1，单词 Embedding

单词的 Embedding 有很多种方式能够获取，例如能够选用 Word2Vec、Glove 等算法预练习得到，也能够在 Transformer 中练习得到。

1.2，方位 Embedding

Transformer 中除了单词的 Embedding，还需要运用方位 Embedding 表明单词出现在语句中的方位。由于 Transformer 不选用 RNN 的结构，而是运用全局信息，不能运用单词的顺序信息，而这部分信息对于 NLP 来说非常重要。所以 Transformer 中运用方位 Embedding 保存单词在序列中的相对或绝对方位。

方位 Embedding 用 PE 表明，PE 的维度与单词 Embedding 是相同的。PE 能够经过练习得到，也能够运用某种公式核算得到。在 Transformer 中选用了后者，核算公式如下：

PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})

其间，pos 表明单词在语句中的方位，d 表明 PE的维度 (与词 Embedding 相同)，2i 表明偶数的维度，2i+1 表明奇数维度 (即 2i≤d, 2i+1≤d)。

1.1，TransformerEmbedding 层完结

class PositionalEncoding(nn.Module):
    """
    compute sinusoid encoding.
    """
    def __init__(self, d_model, max_len, device):
        """
        constructor of sinusoid encoding class
        :param d_model: dimension of model
        :param max_len: max sequence length
        :param device: hardware device setting
        """
        super(PositionalEncoding, self).__init__()
        # same size with input matrix (for adding with input matrix)
        self.encoding = torch.zeros(max_len, d_model, device=device)
        self.encoding.requires_grad = False  # we don't need to compute gradient
        pos = torch.arange(0, max_len, device=device)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position
        _2i = torch.arange(0, d_model, step=2, device=device).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)
        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
        # compute positional encoding to consider positional information of words
    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]
        batch_size, seq_len = x.size()
        # [batch_size = 128, seq_len = 30]
        return self.encoding[:seq_len, :]
        # [seq_len = 30, d_model = 512]
        # it will add with tok_emb : [128, 30, 512]         
class TokenEmbedding(nn.Embedding):
    """
    Token Embedding using torch.nn
    they will dense representation of word using weighted matrix
    """
    def __init__(self, vocab_size, d_model):
        """
        class for token embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TokenEmbedding, self).__init__(vocab_size, d_model, padding_idx=1)
class TransformerEmbedding(nn.Module):
    """
    token embedding + positional encoding (sinusoid)
    positional encoding can give positional information to network
    """
    def __init__(self, vocab_size, max_len, d_model, drop_prob, device):
        """
        class for word embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TransformerEmbedding, self).__init__()
        self.tok_emb = TokenEmbedding(vocab_size, d_model)
        self.pos_emb = PositionalEncoding(d_model, max_len, device)
        self.drop_out = nn.Dropout(p=drop_prob)
    def forward(self, x):
        tok_emb = self.tok_emb(x)
        pos_emb = self.pos_emb(x)
        return self.drop_out(tok_emb + pos_emb)

二，Transformer 整体结构

论文中给出用于中英文翻译使命的 Transformer 整体架构如下图所示：

能够看出 Transformer 架构由 Encoder 和 Decoder 两个部分组成：其间 Encoder 和 Decoder 都是由 N=6 个相同的层堆叠而成。Multi-Head Attention 结构是 Transformer 架构的中心结构，其由多个 Self-Attention 组成的。

Transformer 架构更具体的可视化图如下所示:

2.1，Transformer 发展史

以下是 Transformer 模型（简短）历史中的一些关键节点：

Transformer 架构于 2017 年 6 月推出。原本研讨的要点是翻译使命。随后推出了几个有影响力的模型，包括

2018 年 6 月: GPT, 第一个预练习的 Transformer 模型，用于各种 NLP 使命并获得极好的结果
2018 年 10 月: BERT, 另一个大型预练习模型，该模型旨在生成更好的语句摘要（下一章将具体介绍！）
2019 年 2 月: GPT-2, GPT 的改善（而且更大）版别，由于品德问题没有当即公开发布
2019 年 10 月: DistilBERT, BERT 的提炼版别，速度进步 60%，内存减轻 40%，但仍保留 BERT 97% 的功能
2019 年 10 月: BART 和 T5, 两个运用与原始 Transformer 模型相同架构的大型预练习模型（第一个这样做）
2020 年 5 月: GPT-3, GPT-2 的更大版别，无需微调即可在各种使命上表现杰出（称为零样本学习）

这个列表并不全面，仅仅为了突出一些不同类型的 Transformer 模型。大体上，它们能够分为三类：

GPT-like (也被称作自回归 Transformer 模型)
BERT-like (也被称作自动编码 Transformer 模型)
BART/T5-like (也被称作序列到序列的 Transformer 模型)

Transformer 是大模型，除了一些特例（如 DistilBERT）外，完结更好功能的一般策略是添加模型的大小以及预练习的数据量。其间，GPT-2 是运用「transformer 解码器模块」构建的，而 BERT 则是经过「transformer 编码器」模块构建的。

模型	发布时刻	参数量	预练习数据量
GPT	2018 年 6 月	1.17 亿	约 5GB
GPT-2	2019 年 2 月	15 亿	40GB
GPT-3	2020 年 5 月	1,750 亿	45TB

三，Multi-Head Attention 结构

Encoder 和 Decoder 结构中公共的 layer 之一是 Multi-Head Attention，其是由多个 Self-Attention 并行组成的。Encoder block 只包括一个 Multi-Head Attention，而 Decoder block 包括两个 Multi-Head Attention (其间有一个用到 Masked)。

3.1，Self-Attention 结构

Self-Attention 中文翻译为自留意力机制，论文中叫作 Scale Dot Product Attention，它是 Transformer 架构的中心，其结构如下图所示：

那么要点来了，第一个问题：Self-Attention 结构的最初输入 Q(查询), K(键值), V(值) 这三个矩阵怎样了解呢？其代表什么，经过什么核算而来？

在 Self-Attention 中，Q、K、V 是在同一个输入（比方序列中的一个单词）上核算得到的三个向量。具体来说，咱们能够经过对原始输入词的 embedding 进行线性变换（比方运用一个全衔接层），来得到 Q、K、V。这三个向量的维度通常都是相同的，取决于模型设计时的决议计划。

第二个问题：Self-Attention 结构怎样了解，Q、K、V的作用是什么？这三个矩阵又怎样核算得到终究的输出？

在核算 Self-Attention 时，Q、K、V 被用来核算留意力分数，即用于表明当时方位和其他方位之间的联系。留意力分数能够经过 Q 和 K 的点积来核算，然后将分数除以 8，再经过一个 softmax 归一化处理，得到每个方位的权重。然后用这些权重来加权核算 V 的加权和，即得到当时方位的输出。

将分数除以 8 的操作，对应图中的 Scale 层，这个参数 8 是 K 向量维度 64 的平方根结果。

3.2，Self-Attention 完结

文章的 Self-Attention 层和论文中的 ScaleDotProductAttention 层含义是相同的。

输入序列单词的 Embedding Vector 经过线性变换（Linear 层）得到 Q、K、V 三个向量，并将它们作为 Self-Attention 层的输入。假定输入序列的长度为 seq_len，则 Q、K 和 V 的形状为（seq_len，d_k），其间， $dk\text{d}_{\text{k}}$ 表明每个词或向量的维度，也是 $Q$ 、 $K$ 矩阵的列数。在论文中，输入给 Self-Attention 层的 Q、K、V 的向量维度是 64， Embedding Vector 和 Encoder-Decoder 模块输入输出的维度都是 512。

Embedding Vector 的大小是咱们能够设置的超参数—基本上它便是咱们练习数据会集最长语句的长度。

Self-Attention 层的核算进程用数学公式可表达为:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax} (\frac{QK^T}{\sqrt{d_k}})V

以下是一个示例代码，它创立了一个 ScaleDotProductAttention 层，并将 Q、K、V 三个张量传递给它进行核算：

class ScaleDotProductAttention(nn.Module):
    def __init__(self, ):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim = -1)
    def forward(self, Q, K, V, mask=None):
        K_T = K.transpose(-1, -2) # 核算矩阵 K 的转置  
        d_k = Q.size(-1)
        # 1, 核算 Q, K^T 矩阵的点积，再除以 sqrt(d_k) 得到留意力分数矩阵
        scores = torch.matmul(Q, K_T) / math.sqrt(d_k)
        # 2, 如果有掩码，则将留意力分数矩阵中对应掩码方位的值设为负无穷大
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        # 3, 对留意力分数矩阵按照终究一个维度进行 softmax 操作，得到留意力权重矩阵，值规模为 [0, 1]
        attn_weights = self.softmax(scores)
        # 4, 将留意力权重矩阵乘以 V，得到终究的输出矩阵
        output = torch.matmul(attn_weights, V)
        return output, attn_weights
# 创立 Q、K、V 三个张量
Q = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)
K = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)
V = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)
# 创立 ScaleDotProductAttention 层
attention = ScaleDotProductAttention()
# 将 Q、K、V 三个张量传递给 ScaleDotProductAttention 层进行核算
output, attn_weights = attention(Q, K, V)
# 打印输出矩阵和留意力权重矩阵的形状
print(f"ScaleDotProductAttention output shape: {output.shape}") # torch.Size([5, 10, 64])
print(f"attn_weights shape: {attn_weights.shape}") # torch.Size([5, 10, 10])

3.3，Multi-Head Attention

Multi-Head Attention (MHA) 是根据 Self-Attention (SA) 的一种变体。MHA 在 SA 的基础上引入了“多头”机制，将输入拆分为多个子空间，每个子空间别离履行 SA，终究将多个子空间的输出拼接在一起并进行线性变换，然后得到终究的输出。

对于 MHA，之所以需要对 Q、K、V 进行多头（head）区分，其意图是为了增强模型对不同信息的重视。具体来说，多组 Q、K、V 别离核算 Self-Attention，每个头天然就会有独立的 Q、K、V 参数，然后让模型一起重视多个不同的信息，这有些相似 CNN 架构模型的多通道机制。

下图是论文中 Multi-Head Attention 的结构图。

从图中能够看出， MHA 结构的核算进程可总结为下述进程:

将输入 Q、K、V 张量进行线性变换（Linear 层），输出张量尺度为 [batch_size, seq_len, d_model]；
将前面进程输出的张量，按照头的数量（n_head）拆分为 n_head 子张量，其尺度为 [batch_size, n_head, seq_len, d_model//n_head]；
每个子张量并行核算留意力分数，即履行 dot-product attention 层，输出张量尺度为 [batch_size, n_head, seq_len, d_model//n_head]；
将这些子张量进行拼接 concat ，并经过线性变换得到终究的输出张量，尺度为 [batch_size, seq_len, d_model]。

总结：由于 GPU 的并行核算特性，进程2中的张量拆分和进程4中的张量拼接，其实都是经过 review 算子来完结的。一起，也能发现SA 和 MHA 模块的输入输出矩阵维度都是相同的。

3.4，Multi-Head Attention 完结

Multi-Head Attention 层的输入同样也是三个张量：查询（Query）、键（Key）和值（Value），其核算进程用数学公式可表达为:

MultiHead(Q,K,V)=Concat(head1,…,headh)WOwhereheadi=Attention(QWiQ,KWiK,VWiV)\text{MultiHead(Q, K, V )} = \text{Concat}(\text{head}_{1}, …, \text{head}_{\text{h}})W^O \\ \text{where head}_{\text{i}} = \text{Attention}(QW_i^Q , KW_i^K , VW_i^V )

一般用 d_model 表明输入嵌入向量的维度， n_head 表明分割成多少个头，因而，d_model//n_head 天然表明每个头的输入和输出维度，在论文中 d_model = 512，n_head = 8，d_model//n_head = 64。值得留意的是，由于每个头的维数削减，总核算成本与具有全维的单头留意力是相似的。

Multi-Head Attention 层的 Pytorch 完结代码如下所示：

class MultiHeadAttention(nn.Module):
    """Multi-Head Attention Layer
    Args:
        d_model: Dimensions of the input embedding vector, equal to input and output dimensions of each head
        n_head: number of heads, which is also the number of parallel attention layers
    """
    def __init__(self, d_model, n_head):
        super(MultiHeadAttention, self).__init__()
        self.n_head = n_head
        self.attention = ScaleDotProductAttention()
        self.w_q = nn.Linear(d_model, d_model)  # Q 线性变换层
        self.w_k = nn.Linear(d_model, d_model)  # K 线性变换层
        self.w_v = nn.Linear(d_model, d_model)  # V 线性变换层
        self.fc = nn.Linear(d_model, d_model)   # 输出线性变换层
    def forward(self, q, k, v, mask=None):
        # 1. dot product with weight matrices
        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v) # size is [batch_size, seq_len, d_model]
        # 2, split by number of heads(n_head) # size is [batch_size, n_head, seq_len, d_model//n_head]
        q, k, v = self.split(q), self.split(k), self.split(v)
        # 3, compute attention
        sa_output, attn_weights = self.attention(q, k, v, mask)
        # 4, concat attention and linear transformation
        concat_tensor = self.concat(sa_output)
        mha_output = self.fc(concat_tensor)
        return mha_output
    def split(self, tensor):
        """
        split tensor by number of head(n_head)
        :param tensor: [batch_size, seq_len, d_model]
        :return: [batch_size, n_head, seq_len, d_model//n_head], 输出矩阵是四维的，第二个维度是 head 维度
        # 将 Q、K、V 经过 reshape 函数拆分为 n_head 个头
        batch_size, seq_len, _ = q.shape
        q = q.reshape(batch_size, seq_len, n_head, d_model // n_head)
        k = k.reshape(batch_size, seq_len, n_head, d_model // n_head)
        v = v.reshape(batch_size, seq_len, n_head, d_model // n_head)
        """
        batch_size, seq_len, d_model = tensor.size()
        d_tensor = d_model // self.n_head
        split_tensor = tensor.view(batch_size, seq_len, self.n_head, d_tensor).transpose(1, 2)
        # it is similar with group convolution (split by number of heads)
        return split_tensor
    def concat(self, sa_output):
        """ merge multiple heads back together
        Args:
            sa_output: [batch_size, n_head, seq_len, d_tensor]
            return: [batch_size, seq_len, d_model]
        """
        batch_size, n_head, seq_len, d_tensor = sa_output.size()
        d_model = n_head * d_tensor
        concat_tensor = sa_output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        return concat_tensor

四，Encoder 结构

Encoder 结构由 $N=6\text{N} = 6$ 个相同的 encoder block 堆叠而成，每一层（ layer）主要有两个子层（sub-layers）: 第一个子层是多头留意力机制（Multi-Head Attention），第二个是简单的方位全衔接前馈网络（Positionwise Feed Forward）。

上图赤色框框出的部分是 Encoder block，很明显其是 Multi-Head Attention、Add&Norm、Feed Forward、Add & Norm 层组成的。别的在论文中 Encoder 组件由 $N=6\text{N} = 6$ 个相同的 encoder block 堆叠而成，且 encoder block 输入矩阵和输出矩阵维度是相同的。

4.1，Add & Norm

Add & Norm 层由 Add 和 Norm 两部分组成。这儿的 Add 指 X + MultiHeadAttention(X)，是一种残差衔接。Norm 是 Layer Normalization。Add & Norm 层核算进程用数学公式可表达为:

LayerNorm(X+MultiHeadAttention(X))\text{Layer Norm}(X + \text{MultiHeadAttention}(X))

Add 比较简单，这儿要点讲下 Layer Norm 层。Layer Norm 是一种常用的神经网络归一化技能，能够使得模型练习愈加稳定，收敛更快。它的主要作用是对每个样本在特征维度上进行归一化，削减了不同特征之间的依靠联系，进步了模型的泛化才能。Layer Norm 层的核算可视化如下图所示:

Layer Norm 层的 Pytorch 完结代码如下所示:

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-12):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps
    def forward(self, x):
        mean = x.mean(-1, keepdim=True) # '-1' means last dimension. 
        var = x.var(-1, keepdim=True)
        out = (x - mean) / torch.sqrt(var + self.eps)
        out = self.gamma * out + self.beta
        return out
# NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
# 1，Activate nn.LayerNorm module
layer_norm1 = nn.LayerNorm(embedding_dim)
pytorch_ln_out = layer_norm1(embedding)
# 2，Activate my nn.LayerNorm module
layer_norm2 = LayerNorm(embedding_dim)
my_ln_out = layer_norm2(embedding)
# 比较结果
print(torch.allclose(pytorch_ln_out, my_ln_out, rtol=0.1,atol=0.01))  # 输出 True

4.2，Feed Forward

Feed Forward 层全称是 Position-wise Feed-Forward Networks，其本质是一个两层的全衔接层，第一层的激活函数为 Relu，第二层不运用激活函数，核算进程用数学公式可表达为：

FFN(X)=max(0,XW1+b1)W2+b2\text{FFN}(X) = \text{max}(0, XW_1 + b_1 )W_2 + b_2

除了运用两个全衔接层来完结线性变换，别的一种方式是运用 kernal_size = 1 的两个 $111\times 1$ 卷积层，输入输出维度不变，都是 512，中心维度是 2048。

PositionwiseFeedForward 层的 Pytorch 完结代码如下所示:

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_diff, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_diff)
        self.fc2 = nn.Linear(d_diff, d_model)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(drop_prob)
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

4.3，Encoder 结构的完结

根据前面 Multi-Head Attention, Feed Forward, Add & Norm 的内容咱们能够完好的完结 Encoder 结构。

Encoder 组件的 Pytorch 完结代码如下所示:

class EncoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, n_head, drop_prob=0.1):
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, n_head)
        self.ffn = PositionwiseFeedForward(d_model, ffn_hidden)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(drop_prob)
        self.dropout2 = nn.Dropout(drop_prob)
    def forward(self, x, mask=None):
        x_residual1 = x
        # 1, compute multi-head attention
        x = self.mha(q=x, k=x, v=x, mask=mask)
        # 2, add residual connection and apply layer norm
        x = self.ln1( x_residual1 + self.dropout1(x) )
        x_residual2 = x
        # 3, compute position-wise feed forward
        x = self.ffn(x)
        # 4, add residual connection and apply layer norm
        x = self.ln2( x_residual2 + self.dropout2(x) )
        return x
class Encoder(nn.Module):
    def __init__(self, enc_voc_size, seq_len, d_model, ffn_hidden, n_head, n_layers, drop_prob=0.1, device='cpu'):
        super().__init__()
        self.emb = TransformerEmbedding(vocab_size = enc_voc_size,
                                        max_len = seq_len,
                                        d_model = d_model,
                                        drop_prob = drop_prob,
                                        device=device)
        self.layers = nn.ModuleList([EncoderLayer(d_model, ffn_hidden, n_head, drop_prob) 
                                     for _ in range(n_layers)])
    def forward(self, x, mask=None):
        x = self.emb(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

五，Decoder 结构

上图右边红框框出来的是 Decoder block，Decoder 组件也是由 $N=6\text{N} = 6$ 个相同的 Decoder block 堆叠而成。Decoder block 与 Encoder block 相似，可是存在一些区别：

包括两个 Multi-Head Attention 层。
第一个 Multi-Head Attention 层选用了 Masked 操作。
第二个 Multi-Head Attention 层的 K, V 矩阵运用 Encoder 的编码信息矩阵 C 进行核算，而 Q 运用上一个 Decoder block 的输出核算。这样做的好处是在 Decoder 的时分，每一位单词都能够运用到 Encoder 所有单词的信息 (这些信息无需 Mask)

留意，解码器块中的第一个留意力层关联到解码器的所有（过去的）输入，可是第二留意力层运用编码器的输出。因而，它能够拜访整个输入语句，以最好地猜测当时单词。这是非常有用的，由于不同的言语能够有语法规则将单词按不同的顺序排列，或者语句后面提供的一些上下文可能有助于确定给定单词的最佳翻译。

别的，Decoder 组件后面还会接一个全衔接层和 Softmax 层核算下一个翻译单词的概率。

Decoder 组件的代码完结如下所示:

class DecoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
        super(DecoderLayer, self).__init__()
        self.mha1 = MultiHeadAttention(d_model, n_head)
        self.ln1 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(p=drop_prob)
        self.mha2 = MultiHeadAttention(d_model, n_head)
        self.ln2 = LayerNorm(d_model)
        self.dropout2 = nn.Dropout(p=drop_prob)
        self.ffn = PositionwiseFeedForward(d_model, ffn_hidden)
        self.ln3 = LayerNorm(d_model)
        self.dropout3 = nn.Dropout(p=drop_prob)
    def forward(self, dec_out, enc_out, trg_mask, src_mask):
        x_residual1 = dec_out
        # 1, compute multi-head attention
        x = self.mha1(q=dec_out, k=dec_out, v=dec_out, mask=trg_mask)
        # 2, add residual connection and apply layer norm
        x = self.ln1( x_residual1 + self.dropout1(x) )
        if enc_out is not None:
            # 3, compute encoder - decoder attention
            x_residual2 = x
            x = self.mha2(q=x, k=enc_out, v=enc_out, mask=src_mask)
            # 4, add residual connection and apply layer norm
            x = self.ln2( x_residual2 + self.dropout2(x) )
        # 5. positionwise feed forward network
        x_residual3 = x
        x = self.ffn(x)
        # 6, add residual connection and apply layer norm
        x = self.ln3( x_residual3 + self.dropout3(x) )
        return x
class Decoder(nn.Module):
    def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
        super().__init__()
        self.emb = TransformerEmbedding(d_model=d_model,
                                        drop_prob=drop_prob,
                                        max_len=max_len,
                                        vocab_size=dec_voc_size,
                                        device=device)
        self.layers = nn.ModuleList([DecoderLayer(d_model=d_model,
                                                  ffn_hidden=ffn_hidden,
                                                  n_head=n_head,
                                                  drop_prob=drop_prob)
                                     for _ in range(n_layers)])
        self.linear = nn.Linear(d_model, dec_voc_size)
    def forward(self, trg, src, trg_mask, src_mask):
        trg = self.emb(trg)
        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)
        # pass to LM head
        output = self.linear(trg)
        return output

六，Transformer 总结

Transformer 与 RNN 不同，能够比较好地并行练习。
Transformer 自身是不能运用单词的顺序信息的，因而需要在输入中添加方位 Embedding，否则 Transformer 便是一个词袋模型了。
Transformer 的要点是 Self-Attention 结构，其间用到的 Q, K, V矩阵经过输出进行线性变换得到。
Transformer 中 Multi-Head Attention 中有多个 Self-Attention，能够捕获单词之间多种维度上的相联系数 attention score。

6.1，Transformer 完好代码完结

根据前面完结的 Encoder 和 Decoder 组件，咱们能够完结 Transformer 模型的完好代码，如下所示:

import torch
from torch import nn
from models.model.decoder import Decoder
from models.model.encoder import Encoder
class Transformer(nn.Module):
    def __init__(self, src_pad_idx, trg_pad_idx, trg_sos_idx, enc_voc_size, dec_voc_size, d_model, n_head, max_len,
                 ffn_hidden, n_layers, drop_prob, device):
        super().__init__()
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.trg_sos_idx = trg_sos_idx
        self.device = device
        self.encoder = Encoder(d_model=d_model,
                               n_head=n_head,
                               max_len=max_len,
                               ffn_hidden=ffn_hidden,
                               enc_voc_size=enc_voc_size,
                               drop_prob=drop_prob,
                               n_layers=n_layers,
                               device=device)
        self.decoder = Decoder(d_model=d_model,
                               n_head=n_head,
                               max_len=max_len,
                               ffn_hidden=ffn_hidden,
                               dec_voc_size=dec_voc_size,
                               drop_prob=drop_prob,
                               n_layers=n_layers,
                               device=device)
    def forward(self, src, trg):
        src_mask = self.make_pad_mask(src, src, self.src_pad_idx, self.src_pad_idx)
        src_trg_mask = self.make_pad_mask(trg, src, self.trg_pad_idx, self.src_pad_idx)
        trg_mask = self.make_pad_mask(trg, trg, self.trg_pad_idx, self.trg_pad_idx) * \
                   self.make_no_peak_mask(trg, trg)
        enc_src = self.encoder(src, src_mask)
        output = self.decoder(trg, enc_src, trg_mask, src_trg_mask)
        return output
    def make_pad_mask(self, q, k, q_pad_idx, k_pad_idx):
        len_q, len_k = q.size(1), k.size(1)
        # batch_size x 1 x 1 x len_k
        k = k.ne(k_pad_idx).unsqueeze(1).unsqueeze(2)
        # batch_size x 1 x len_q x len_k
        k = k.repeat(1, 1, len_q, 1)
        # batch_size x 1 x len_q x 1
        q = q.ne(q_pad_idx).unsqueeze(1).unsqueeze(3)
        # batch_size x 1 x len_q x len_k
        q = q.repeat(1, 1, 1, len_k)
        mask = k & q
        return mask
    def make_no_peak_mask(self, q, k):
        len_q, len_k = q.size(1), k.size(1)
        # len_q x len_k
        mask = torch.tril(torch.ones(len_q, len_k)).type(torch.BoolTensor).to(self.device)
        return mask

参考资料

github.com/hyunwoongko…
The Illustrated Transformer
Learning Word Embedding
Transformer模型详解（图解最完好版）

“本文正在参与「金石计划」”

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。