You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the output of each sub-layer is LayerNorm(x+Sublayer(x))... We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.
but the code is
return x + self.dropout(sublayer(self.norm(x)))
It seems it should be
return self.norm(x + self.dropout(sublayer(x)))
instead.
In Encoder and Decoder, where does the extra norm on top of the stack come from?
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by sqrt(dmodel).
In http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder, the paper text says
but the code is
It seems it should be
instead.
In
Encoder
andDecoder
, where does the extranorm
on top of the stack come from?It's described in http://nlp.seas.harvard.edu/2018/04/03/attention.html#additional-components-bpe-search-averaging, but it may be better to link that section from the quoted part, I couldn't find it initially.
Should http://nlp.seas.harvard.edu/2018/04/01/attention.html link to the updated version http://nlp.seas.harvard.edu/2018/04/03/attention.html?
The text was updated successfully, but these errors were encountered: