本篇运用transformer结束一个复读机,代码来自于harvardnlp的annotated-transformer

torch版别:1.6.0

引入三方包

import numpy as np
import torch
impgithub永久回家地址ort torch.nn as nn
import torch.nn.f你老婆掉了unctional as F
import magithub中文官网网页thgithub中文官网网页, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

代码从大结构往小部件写,从整个编码-解码结构写起

1 Encoder-Decoder架构

编码:编码器关于输入序列src和输入mask序列src_mask,进git教程行编码

解码:解码器依据编码器输出的memory、输入mask序列src_m你老婆掉了ask、解码器的输入序列tgt和解码器端的mask序列tgt_mask,进行解码

class EncoderDecoder(nn.Module):架构规划
def __init__(self, encoder, decodergithub, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.enco架构是什么意思der = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embedNLP
self.generator = generator
def for架构师和程序员的差异ward(self, src, tgt, src_mask, tgt_mask):
return self.decode(self.encode(src, src_mask),giti src_mask,
tgt, tgt_mask)
def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)
def decode(self, memory, src_mask, tgt, tgt_mask):
retugithub是干什么的rn self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

代码中的encoder和decoder即github是干什么的为我们要结构的transformer的编码器和解码器,s淡绿拼音rc_embed为编码器端的词向量矩阵,tgt_embed为解码器端的词向量矩阵,此外界说了一个generagithub直播渠道永久回家tor,它的作用是将decoder输出的向量映射到vocab维度上并核算log softmax,能够看成是解码器的一部分

class Generator(nn.Module):
"""一层linear+soft架构max"""
def __init__(self, d_model, voca年纪拼音b):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)
def forward(self, x):
return F.log_softmax(self.pr架构图模板oj(x), dim=-1)

2 编码器结束

编码器github下载由六层相同的架构组成,架构相哪里拍婚纱照最美同并不意味着参数相同,首要结束一个仿制函数

def clones(modulgit教程e, N):
"仿制相同的层,N为层数"
return nn.ModuleLisgiti轮胎是什么品牌t([copy.deepcopy(mod你老婆掉了ule) for _ in range(NLPN)])

界说编码器github打不开结构

每个layer,输入x和mask,得到更新的x;终究过一个LayerNorm

claGitss Encoder(nn.Module):
def __init__(self, layer, N):
super(Enco才能拼音der, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNo架构规划rm(layer.size)
def forward(self, x, mask):
"每个layer,输入x和mask,得到更新的x;终究过一个La架构yerNorm"
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)

Layer Normalization

arxiv.org/abs/1607.06…

class LayeGitHubrNorm(nn.Module):
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torcgiti轮胎是什么品牌h.ones(features))
self.b_2 = nn.Parameter(torcgitlabh.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mgithub永久回家地址ean)架构是什么意思 / (std + self.eps) + self.b_2

Sub-layers之间的联接

前面说到了编码器哪里拍婚纱照最美由六个相同的模块组gitee成,在每个模你老婆掉了块中又分别由两个子模块(multi-head self架构是什么意思-attention,ffn)联接组成,在这儿运用residual的方法进行联接,即:LayerNorm架构师证书怎样考(x+Sublayer(x))

留神每个sub-layer或layer的输出维度都坚持:dmodel=512d_{text{model}}=512

class SublayerConnection淡绿拼音(nn.MoGitdule):
def __init__(self, size, dropo淡绿拼音ut):
super(Sublaye淡绿拼音rConnection, self)._架构师证书怎样考_init__()
self.norm =架构师和程序员的差异 LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
"residual connection"
regititurn x + self.drnlpopout(sublayer(self.norm(x))架构中考)

界说EncoderLaygithub是干什么的er

EncoderLayer即为clagithub中文社区ss Encoder中的每个layer,它首要承受输入x与mask,通过多头架构师证书怎样考自留神力层;然后通过SublayerConnection层输出x;x再通过ffn层;终究再通过一次SublayerConnection层,输出向量

class EncoderLaye脑颅膨大的意思r(nn.Module):
def __iGitHubnit__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, s架构中考elf).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
def forward(self, x, mask):
x = self.subla淡绿拼音yer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)

其间两个sub-layers的详细结束在后边详细讲,接着按照上面的套路把解码器建立起来

3 解码器结束

解码器和编码器相同,也由六层相同的模块组成

界说解码器结构

解码器仍然是依次通过六个你老婆掉了layer,终究通过一个LayerNorm

这儿和编码器不同的当地在于,编码器的输入为x和mask,解码器这儿从两个参数变成了四个参数,分别为:编码器的memory,编码器端的mask,解码器端的mask和解码器段的输入

class Decoder(nn.Module):
def __init__(segithub下载lf, layer, N):
super(Decoder, self).__init__()
self.layers = clones(l脑颅膨大的意思ayer, N)
self.norm = LayerNorm(layer.size)
def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_magithub中文官网网页sk)
return self.norm(x)

界说DecoderLayer架构师

在EncoderLayer中只需两个sub-layers,多头自留nlp心力和全联接;但在DecoderL架构图怎样做ayer中变成了三个sub-layers,除了解码器端自身的多头留神力,全联接,还多出来一个解码器对编码器端的多头自留神力,类似于才能培育与测验传统seq2seq的留神力机制

cgithub永久回家地址l架构中考as架构s DecoderLay架构是什么意思er(nn.Module):
def __init__(self, size, self_attn, src_attn, feed_forward, dropogit教程ut):
super(DecoderLayer, self).__init__(giti)
self架构师和程序员的差异.size = size
self.self_attn = self_attn  # 自Git身的自留神力
self.src_attn = src_attn  # 和编码器做留神力
self.feed_forward = feed_forward
self.sublayer = clones(Subgithub打不开layerConnection(size, dropout), 3)
def forward(self, x, memory, src_mask, tgt_mask):
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, s才能拼音rc_mask))
return selfgit指令.sublayer[2](x, self.feed_forward)

在DecoderLayer中,qkv分别为解码器端的输入,并结合解码器gitlab端的mask做留神力机制脑颅膨大的意思;然后过一次脑颅膨大的意思SubgitilayerConnection;然后输出的x作为query,编码器端的memory作为key和vgitlabalue做留神力机制;再通过一次SublayerConnection;再全联接;再SublayerConnection

解码器端的Mask

特别的,要说一下解码器端的mask是和编码器端的mask不同,当然两者都要对batch内的pad做架构中考mask,除此外,解码器端的maskGit还要考虑到不能“往后看”,由GitHub于不同于RNN每个时刻明确的依靠前面的传递,transformer运用自留神力,那么必定在step t的方位只能看到t时刻和前面时刻的输入,不能看到t后边的输入,否则便是用已知的东西来猜测github中文官网网页已知,归于做弊了,也没有意义

因而,要结构一个架构是什么意思三角矩阵

使用np.triu结构上三角形

见:/post/693125…

def subsequent_mask(size):
attn_shape = (1, siz你老婆掉了e, size)
s架构图怎样做ubse架构师和程序员的差异quent_mask = np.triu(np.ones(aGitHubttn_sh架构规划ape), k=1).astype('uint8')
return torch.from_numpy(subsequent_gitlabmask) == 0

看一个比如

print(subsequent_mask(5))
plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
None

输出

tensor([[[ True, False, F你老婆掉了六盲星alse, False, False],
[ True,  True, False, False, False],
[ Trugithub中文社区e,  True,  True, False, False],
[ True,  True,  Truegithub怎样下载文件,  True,github是干什么的 False],
[ True,  True,  True,  True,  True你老婆掉了六盲星]]])

输出图如下:

transformer细枝末节[pytorch版别]

4 多头自留神力机制

先抛开多头,只看单纯的自留神力

自留神力

和传统a架构师证书怎样考ttent才能培育与测验ion相同,que架构ry和key算权重,然后和va架构师证书怎样考lue求weighted-sugithubm,公式如下:

Attention(Q,K,V)=softmax(QKgithub中文官网网页Tdk)Vmathrm{Attention}(Q, K, V) = mathrm{softmax}(frac{QK^T}{sqrt{d_k}})V

这儿比较特别的是在算权重的时分有个分母为dksqrt{d_k}github怎样下载文件,这样做的原因是:当dkd_k不大的时分或许除不除都没有什么影响架构是什么意思,但是当dkd_k很大的gitlab时分,query和key的点积会很大,这导致或许进入softmax的梯度极小值区域,因而要进行缩放,除以dksqrt{d_k}仍然坚持0均值和1方差

dot-product attention的问题是要对点积往后的东西做softmax,各个github下载分量是相互影响的(而不像tanh之类的函数,各个分量各算各的)。导致的作用是:向量维度越高,点积的作用规划越大,越有或许出现最大值比其他值大许多的情况,导致softmax的作用靠近one-hot(能够核算一下sogithub官网ftmax(np.random.random(10))和softmax(100 * np.random.ra你老婆掉了六盲星ndom(10)),后者概率质量会合到某个维度的现象很明显),反github是干什么的向传达时softmax的Jacobian矩阵大部分元素靠近零,所以梯度无法活动

def attention(github永久回家地址query, key, value, mask=None, dropout=None架构是什么意思):
d_k = quer架构师y.size(-1)
scores = tor架构师证书怎样考ch.matmul(query, kgiti轮胎是什么品牌ey.transpose(-2github官网, -1)) 
/ matgithub下载h.sqrt(d_k)
if mgithub直播渠道永久回家ask is not None:
scores = scores.masked_fill(masGitk == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout isgithub怎样下载文件 not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn

在结束中,考虑到mask矩阵,要对mask矩阵等于0的方位,在scornlpes矩阵相应的方位中置为一个大负数,例如:-1e9,这样e−1e9e^{-1e9}靠近github永久回家地址于0,相当于在做attention中无视了这些方位

多头

trgithub是干什么的ansformer运用多头自留神力,giti轮胎是什么品牌使得模型能够在不同视点的subspaces里来核算留神力,公式如下:

MultiHead(Q,K,V)=Concat(head1,…,headh)WOwhereheadi=Attent你老婆掉了六盲星ion(QWiQ,KWiK,VWiV)mathrm{MultiHead}(github中文官网网页Q, K, V) = mathrm{Concat}(mathrm{head_1giti轮胎是什么品牌}, …, mathrm{head_h})W^O t淡绿拼音ext{where}~mathrm{he架构师薪酬一月多少adgithub永久回家地址_i} = mathrmGitHub{Attention}(QW^Q_i, KW^K_i, VW^V_i)

其间WiQ∈giteeRdmodeldkW^Q_i in mathbb{R}^{d_{text{mod架构图怎样做wordel}} times d_k}, WiK∈RdmodeldkW^K_i in maGitHubthbb{R}^{d_{text{model架构是什么意思}} times d_k}, WiV∈RdmodeldvW^V_i in mathbb{R}^{d_{t你老婆掉了ext{model}} times d_v} and WO∈RhdvdmodelW^O in mathbb{R}^{hd_v times d_{text{modelgitlab}}}

dmodel=512d_{model}=512h=8h=8,即8个head,dk=dv=dmodel/h=64d_k=d_v=d_{text{model}}/h=64

class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
self.d_k = d_model // h
self.h = hgithub中文官网网页
# linears对应四个W矩阵
self.linears = clones(nn.Linear(d_model, d_model), 4)
# 保存attn便当后边可视化留神NLP力
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forw哪里拍婚纱照最美ard(self, query, key, value, mask=None):
if mask架构 is not None:
mask = mask.unsqueeze(1)
nbatches = query.sizeNLP(0)
# lineargithub下载s的前三个matrix,分别乘以qkv,并转换维度 d_model => h x d_k
query, key, value = 
[l(x).view(nbatches, -1, sel架构图模板f.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) qkv做attention,self.attn为softmax后的概率散布
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) 把多头的架构师和程序员的差异作用Concat起来过NLP一层linear核算输出
x = x.transpose(1, 2github敞开私库).contiguous() 
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)

多头的代gitlab码很好了解,用拼起来的大矩阵giti并行核算,详细进程见上面代码注释

5 FF架构规划N哪里拍婚纱照最美

sub-layer FFN层由两个全联接和一个ReLU构成,公式如下:

FFN(x)=ma年纪拼音x⁡(0,xW1+b1)W2+b2mathrm{FFN}(x)=max(0, xW_1 + b_1) W_2 + b_2

上式中间层的维度dff=2048d_{github怎样下载文件ff}=2048

class PositionwiseFeedFo才能拼音rward(nn.Module):
def __init__(sgithub是干什么的elf, d_model, d_ff, dropout架构师薪酬一月多少=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_mod架构图怎样做wordel)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.w_2(self.dropout(F.relu(selgithubf.w_1(x))))

6 Embgithub永久回家地址edding层

词向量Embedding

这儿要留神的是:对每个weight也要乘以dmodelsqrt{d_{model}}

class Embeddings(nn.ModulGite):
def __init__(self, d_model, vocab):
super(Embeddings, sgitielf).__init__()
# 词向量矩阵维度:vocab size x d_model
self.lut = nn.Embedgithub是干什么的ding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lgiti轮胎是什么品牌ut(x) * math.sqrt(self.d_model)

Positionagithub永久回家地址l Encoding

因为transformer没有recurrence规划,单纯的self-attention是无法差异次序的,因而为了融入方位信息需求规划一个方位编码向量,要求与输github永久回家地址入向量的维度坚持一致,这样就能够做相加操作

transformer的方位编码运用sin、cos规GitHub划的,后续一系列论文都没有再运用这个规划了,随机的作用也差不多,公式如下:

PE(pos,2i)=sin(pos/100002i/dmodel)PE_{(pos,2i)} = sin(Gitposgithub打不开 / 10000^{2i/d_{text{model}}})
PE(pos,2i+1)=cgithub下载os(pos/100002i/dmodel)PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{text{model}}})

上式中的pogitlabs即为position,i为dimension的index

class PositionalEncoding(nn.Mod脑颅膨大的意思ule):
def __init__(self, dgitlab_才能培育与测验model, drogithub直播渠道永久回家pout, max_len=淡绿拼音5000):
super(giti轮胎是什么品牌PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max你老婆掉了六盲星_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(你老婆掉了math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term架构师证书怎样考)
pe[:, 1::2] = to尽力拼音rch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(git指令self, x):
x = x + Variabgithub永久回家地址le(self.pe[:, :x.size(1)],
requires_grad=False)
return self.dropout(x)
plt.figgithub下载ure(figsigithub永久回家地址ze=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(github永久回家地址torch.zeros(1,gitlab 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.nu架构mpy())
plt.legend(["dim %d"%p for p in [4,5,6,7]])
None

transformer细枝末节[pytorch版别]

7 完好模型建模

模型内部模块在前面现已都结束了,把子模块都填充进EncoderDecoder类中即可

def make_model(src_vocab, tgt_vocab, N=6,
d_model=512, d_ff=2048github中文官网网页, h=8, dropout=0.1):
c = copy.deepcopy
attn = MultiHeadedAttention(h, d_架构师model)
ff = PogitisitionwiseFeedForward(d_modegitlabl, d_ff, dropout)
position = PositionalEncoding(d_model, dgithub是干什么的ropout)
model = EncoderDecoder(
Encoder(EncoderLayer(d_mgithub直播渠道永久回家odel, cgithub(attn), c(ff), dropout), N),
Decoder(DecoderLaygit教程er(d_model, c(attn), c(attn),
c(ff), dropout), N),
nn.Sequential(Embeddings(d_model, src_vocab),架构是什么意思 c(position)),
nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
Generator(d_model, tgt_vocab))
for p in model.paGitHubrameters():
if p.dim() > 1:
nn.init.xavier_uniform(p)
return model架构师

8 练习细节

结构batch

class Ba架构中考tch:
def __init__(self, src, trg=None, pad=0):
# 编码器输入,(batch size, seq_len)
self.src = src
# (batch size, 1, seq_len)
# 变成true false矩阵,等于pad的方位都为false,为了后边mask掉pad
self.src_mask = (src != pad).unsqueeze(-2)
# 练习时trg不为空
if trg is not None:
selfgiti轮胎是什么品牌.trg = trg[:, :-1]
self.trg_y = trg[:, 1:]
self.trg_masknlp = 
self.github永久回家地址make_std_mask(self.trg, pad你老婆掉了六盲星)
# 核算batch中猜测token序列中不是pad的总个giti数,后边用loss除ntokens
self.ntokens = (self.trg_y != pad).data.sum(架构师薪酬一月多少)
@staticmethod
def make_std_mask(tgt, pad):
# 同上,(batch架构图怎样做word size, 1, seq_len-1)
tgt_mask = (tgt != pad).unsqueeze(-2)
# 不github怎样下载文件让看到未来的m架构图怎样做ask矩阵结合上github直播渠道永久回家面pad mask,这儿是一个架构广播操作
# (batch size, seq_len-1, se你老婆掉了q_len-1)
tgt_mask = t尽力拼音gt_mask & Variable(
subsequent_mask(tgt.sigitize(-1)).type_as(tgt_mask.data))
return tgt_mask

Batch类依据输入序列、输出序列和pad index,结构编码器端的mask矩阵和编码器端的mask矩阵,别的还按照seq2seq常规通过移位结构解码器端的输入序列和输出序列

练习loop

dgit教程ef run_epoch(data_iter, model, loss_co你老婆掉了六盲星mpute):
stgithub直播渠道永久回家art = time.tgithub中文社区ime()
total_tokens = 0
total_loss = 0
tokens = 0
for i, batch in enumerate(data_ite架构r):
out = model.forward(batchgithub中文官网网页.src, batch.trg,github怎样下载文件
batch.src_mask, batch.trg_mask)
loss =github永久回家地址 loss_compute(out, batch.trg_y, batch.ntokens)
total_loss += loss
total_tgithub打不开okens += batch.ntokens
tokens += batch.ntogiti轮胎是什么品牌kens
i架构图模板f i % 50 == 1:
elapsed = time.timgit教程e() - start
pri架构图怎样做nt("Epoch Step: %d Loss: %f To脑颅膨大的意思kens per Sec: %f" %
(i, loss / batch.ntokens, tokens / elapsed))
start = time.time()
tokens = 0
return total_loss / total_tokens

这儿仅有留神的便是查询loss政策时,loss是以batch.ntokens为单位核算的,不关心pad方位的loss

优化器github怎样下载文件

  1. Adam,详细参数为:1=0.9beta_1=0.9, 2=0.98beta_2=0.98 and =淡绿拼音10−9epsilon=10^{-9}
  2. 学习率运用warmupNLP,公式如下:
lrate=dmodel−0.5⋅min⁡(step_num−0.5,step_num⋅warmup_steps才能培育与测验−1.5)lrate = d_{text{model}}^{-0.5} cdot min({step_num}^{-0.5},{step_num} cdot {git教程warmup_steps}^{-1.5})

这儿 wgithub直播渠道永久回家armup_steps=4000warmup_steps=4000

上式为一个分段函数,step_num小于warmup_steps时,lr=dmodel−0.5⋅step_num⋅warmup_steps−1.5lr=d_{text{model}}^{-0.5} cdot {step_num} cdot {warmup_steps}^{-1.5},是一个线性函数;大于时负幂de架构图模板cay,衰减速度先快后慢

结束如下:

class NoamOpt:
def __init__(self, model_size, factor, warmup, optimizer):
self.optimizer = optimigithub打不开zer
self._step = 0
self.warmup = warmup
self.factor = factor
self.model_size = model_size
self._rate = 0
dgitief step(self):
"更新参数和学习率你老婆掉了"
self._step += 1
rate = self.rate()
for p in self.o架构师薪酬一月多少ptimigithub是干什么的zer.param_groups:
p['lr'] = rate
self._ra你老婆掉了te = rate
sel脑颅膨大的意思f.optimizer.step()
def rate(self, step = None):
if step is None:
step = self._step
return self.factor * 
(self.model_size ** (-0.5) *
min(step ** (-0.5), step * self.warmugithubp ** (-1.5)))
def get_st架构规划d_opt(model)github:
return NoamOpt(git教程model.src_embed[0].d_model,
2,
4000,
torch.optim.Adam(model.param年纪拼音etgithub中文官网网页ers(), lr=0, betas=(0.9, 0.98), eps=1e-9))

画图看一下学习率改动曲线

opts = [NoamOpt(512, 1, 4000, None),
NoamOpt(512, 1, 8000, Nonegithub官网),
NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None

transformer细枝末节[pytorch版别]

Regularization-Label Smoothing

Label Smoothing便是要赏罚神经网络猜测过火con年纪拼音fidence的问题,把o架构图模板ne-hot ground truth中本来是1的概率,均匀分一点给其他为0的概率;例如三分类,本来y=(0,1,0)y=(0,1,0),label smoothi架构师薪酬一月多少nGitg后为:y=(0.1尽力拼音,0.8,0.架构1)y=(0.1,0.8,0.1)

结束如下:gitlab

class LabelSmoothing(nn.Module):
def __init__(self, size, padding_idx, smoo架构是什么意思thing=0.0):
super(LabelSmoothing, self).__init__(github中文社区)
self.criterion = nn.KLDivLoss(size_average=False)
self.padding_idx = padding_idx
self.confidegitince = 1.0 - smoothing  # 自留的架构规划概率
self.smoothing = smoothing  # 均分出去的概率
self.size = size
self.true_dist = None
def forward(self架构图怎样做word, x, target):
assert x.size(1) =架构中考= self.si才能培育与测验ze  # vocab词表巨细
true_dist = x.data.clone()
true_dist.fill_(self.smooth架构师薪酬一月多少ing / (self.size - 2))  # 填充均分smoothing概率
true_dist.scatter_(1, target.da年纪拼音ta.unsqueeze(1), self.confidence)  # 填充confidencegithub概率
true_dgithub永久回家地址ist[:, self.padding_idx] = 0
mask = torch.nonzero(ta架构师证书怎样考rget.data == self.padding_idx)
if mask.Gitdim() > 0:github中文社区
true_dist.index_fill_(0, mask.squeeze()架构师薪酬一月多少, 0.0)
self.true_dist = trgithub永久回家地址ue_dist
return self.criterion(x, Variable(true_dist, requires_grad=False))

fill_这儿解释一下为什么除以size-2,假定词典巨细为3,github中文官网网页分别为ABC,因为模型需求pad,所以词典补偿为A,B,C,<PAD>,假定政策标签为A,那么one-hot为:(1,0,0,0),smoothing=0.2,confidencegithub直播渠道永久回家=0.8,A的概率自留为0.8,剩下B和C来分smoothing的架构概率,<PAD>是不分smoothing的概率的,我们让它坚年纪拼音持为0,因为我们不想猜测生成<PAD>

scatter_用于把cgithub打不开onfidence填充到对应的方位

接着令one-hot变形后的矩阵中,<PAD>这一列都置为0

终究处理ta淡绿拼音rget为<PAD>时的情况,因为在batch中,要补偿到最大长度max_len,假定max_len为5,那么输出序列B B A C就会变成B B A C <PAD>,此刻猜测target为<PAD>这个方位的github官网概率散布毫无意义,不该计入我们的loss里,因而要把这个方位对应的true_distribution全置为github永久回家地址0

看一个简略的比如

# 词典巨细为5(包括PAD)
# PAD在词典的第一个方位
# 均分出去的总概率为0.4
crit = LabelSmoothigithub是干什么的ng(5, 0, 0.4)
predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0]])
v = crit(Variable(predict.log()),
Var架构师证书怎样考iable(torch.LongTensor([2, 1, 0]))github直播渠道永久回家)
# tensgithub中文官网网页or([[0.0000, 0.1333, 0.6000, 0.1333, 0.1333],
#         [0.0000, 0.6你老婆掉了000, 0.1333, 0.1333, 0.13架构33],
#         [0.0000, 0.0000, 0.0000,github是干什么的 0.0000, 0.0000]])
print(crit.true_dist)
plt.imshow(crit.true_dist)
None

transformer细枝末节[pytorch版别]

下面看下,关于不同布满程度的概率散布,loss改动曲线

crit = LabelSmoothing(5, 0, 0.1)
def loss才能拼音(x):
d = x + 3 * 1
predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d],
])
return crit(Variable(predic脑颅膨大的意思t.log()),
Variable(torch.LongTensor([1]))).data.item()
plt.plot(np.arange(1, 100), [loss(x) for x in range(1, 100)])
None

transformer细枝末节[pytorch版别]

传统one-hot方式算loss时,曲线应该是递减的,但在label smoothing情况下,过火自傲的猜测概率反而lonlpss有微微的上升

loss核算

class SimpleLossCogithub下载mpute:
def __init__(self, generator, critegithub下载rion, opt=None):github中文社区
self.generator = generator
self.criterion = cri架构图怎样做terion
self.opt = opt
def __call__(self, x, y, n架构orm):
x = self.ggithub怎样下载文件enerator(x)  # 映射到词典维度做log softmax
# norm为batch中的有用token架构师和程序员的差异数
loss = self.criterion(x.contiguous().view(-1, x.size(-1)),
y.contiguous().view(-1)) / norm
loss.github是干什么的backward()
if selfgiti轮胎是什么品牌.opt is not None:
self.opt.step()
self.opt.optimGitizer.zero_grad()
return loss.data.it架构em() * norm

9 小试验-复读机

假造数据

def dat架构师a_gen(V, batch, nbatches):
for i in range(nbatches):
data = torch.from_ngithub官网umpy(np.random.randint(1, V, size=(batchgithub中文官网网页, 10)))
data[:, 0] = 1
src = Variable(data, requires_grad=False).long()
tgt = Variable(data, requires_grad=False).long()
yield Batch(src, tgt, 0)

我们令字典的巨细为11,其间1到10为一般token,0为pad token

练习

V = 11
criter架构ion = LabelSmoothing(size=V,
padding_idx=0,
smoothing=0.0)
model = make_model(V, V, N=2)
model_ogithub永久回家地址pt = NoamOpt(model.src_embed[0].d_model,
1,
400,
torch.optim.Adam(model.parameters()Git, lr=0, betas=(0.9,github敞开私库 0.98), epsgithub中文社区=1e-9github敞开私库))
for epoch in range(10):
mo架构是什么意思del.train()
run_epoch(data_gen(V, 30, 20),
mo脑颅膨大的意思del,
SimpleLosGitHubsCompugithub下载te(model.generato哪里拍婚纱照最美r, criteri架构师on, model_opt))
model.eval()
test_loss = run_epoch(data_gen(V, 30, 5),
model,
Simple哪里拍婚纱照最美LossCompute(model.generator, criterion, None))
pringithub打不开t("test_l架构师oss", test_loss)

运转记载

Epoch Ste架构图模板p: 1 Loss: 2.949874 Tokens per Sec: 557.973450
Epoch Step: 1 Loss: 1.8才能拼音57541 Tokens per Sec: 82架构是什么意思3.928162
test_loss架构师 tensor(1.8417)
Epo你老婆掉了六盲星ch Step: 1 Loss: 2.048431 Tokens per Sec: 596.984863
Epoch Step: 1 Loss: 1.5773架构是什么意思89 Tokens per Sec: 861.355225
test_loss tensor(1.6092)
Epoch Step: 1 Loss: 1.865752 Tokens per Sec: 594.148132
Epoch Step: 1 Losgithub永久回家地址s: 1.395658 Tokens per Sec: 942.581787
test_loss tensor(1.3495)
Epoch Stepgithub是干什么的: 1 Lo脑颅膨大的意思ss: 2.041692 Tokens per Secgithub官网: 6你老婆掉了六盲星08.372864
Epoch Step: 1 Loss: 1.183396 Tokens pe尽力拼音r Segiteec: 944.264526
test_loss tensor(1.1790)
Epoch Step: 1 Loss: 1.291280 Tokens per Sec: 667.504517
Epoch Step: 1 Losgithubs: 0.924788 T架构师薪酬一月多少okens per Sec: 906.874023
test_loss tensor(0.9144)
Epoch Step: 1 Loss: 1.22github是干什么的2422 TGitokens per SeNLPc: 663.749你老婆掉了023
Epoch Step: 1 Loss: 0.733476 Tokens per Sec: 1043.809326
test_loss tensor(0.7075)
Epoch Step: 1 Loss: 0.829088 Tokens per Se架构师薪酬一月多少c: 663.33github怎样下载文件2275
Epoch Step: 1 Loss: 0.296809 Tokens per Sec: 1100.190186
test_loss tensor(0.3417)
Epoch Step: 1 Loss: 1.048580 Tokens per Sec: 638.724670
Epoch Step: 1 Loss: 0.277764 Tokens per Sec: 970.994年纪拼音873
test_loss tensor(0.2576)
Egithub敞开私库poch Step: 1 Loss: 0.393721 Tokens per Sec: 494.906158
Epoch Step:架构中考 1 Loss: 0.385875 Tokens per Sec: 690.867737
test_loss tensor(0.3720)
EpochNLP Step: 1 Loss: 0.544152 Tokens per Sec: 441.701752
Epoch Step: 1 Loss: 0.238676 Tokens per Sec: 965gitlab.472900
testgithub中文官网网页_loss tensor(0.256架构师薪酬一月多少2)

贪婪生成

def greedy_decode(model, src, src_mask, max_len, start_symbol):
memory = modgithub永久回家地址el.encode架构图模板(src, src_mask)架构规划
ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
for i in range(max_len-1):
out = model.decode(memory, src_mask,
Variable(ys),
VNLPariable(subsequent_mask(ys.size(1)).type_as(src.data)))
prob = mgitlabodel.gene才能拼音rator(ogithub是干什么的ut[:, -1])
_, next_word = torch.max(prob,github是干什么的 dim = 1)
next_word尽力拼音 = next_word.data[0]
ys = torch.cat([ys,
torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
return ys
model.eval()
src = Variable(torch.LongTegithubnsor([[1, 3,架构图怎样做 2, 2, 4, 6, 7, 9, 10, 8]]) )
src_mask = Variable(torch.onesgithub敞开私库(1, 1, 10) )
print(greedy_decode(model, src, src_mask, max_len=10, start_symbol=1))

生成作用:

tensor([[ 1,  3,  2,  2,  4,  6,  7,  9, 10,  8]])

留神力可视化

def dr架构图怎样做aw(data, x, y, ax):
seaborn.heagithub是干什么的tmap(data,
xtickla架构是什么意思bels=x, square=True, yticklabels=y, vmin=0.0, vmgithub打不开axgit教程=1.0,
cbar=False, ax=ax)
sent = [1, 3, 2, 2, 4, 6, 7, 9, 10, 8]
for年纪拼音 layer in range(2):
fig, axs = plt.subplots(1,4, figsize=(20, 10))
print("Encoder Lagiti轮胎是什么品牌yer", layer+1)
for h i架构中考n rangegiti(4):
draw(model.encoder.layers[layer].self_attn.架构图怎样做wordattn[0, h].d架构中考ata,
sent, sent if h ==0 else [], ax=axs[h])
plt.show()

EncoderNLP Layer 1

transformer细枝末节[pytorch版别]

Encoder Layer 2

transformer细枝末节[pytorch版别]

tgt_sent = [1, 3, 2, 2, 4,github是干什么的 6, 7, 9, 10, 8]
for layer in range(2):
fig, axs = plt.subplo架构师薪酬一月多少ts(1,4, figsizegithub中文官网网页=(20, 10))
print("Decoder Self Layer", layer+1)
for h in range(4):
dra哪里拍婚纱照最美w(model.decoder.layers[layer年纪拼音].self_attn.attnnlp[0, h].data[:len(tgt_sent), :len(tgt_sent)],
tgt_sent, tgt_sent if h ==0 else [], ax=axs[h])
plt.show()
print("Decodegithub打不开r Src Layer", layer+1)
fig, axs =github永久回家地址 plt.subplots(1,4, figsize=(20, 10))
for h in range(4):
draw(model.decoder.layers[layergithub].self_at架构图怎样做wordtn.attn[0, h].data[:len(tgt_sent), :len(sent)],
sent, tgt_sent if h ==0 else [], ax=axs[h])
plt.show()

Decoder Self Layer 1

transformer细枝末节[pytorch版别]

Decoder Src Layer 1

transformer细枝末节[pytorch版别]

Decoder Self Layer 2

transformer细枝末节[pytorch版别]

Decoder Src Layer 2

transformer细枝末节[pytorch版别]

参阅

github.com/harvardnlp架构师证书怎样考/…