Possible issues in "The Annotated Transformer" #6

alexeyr · 2019-06-11T08:19:55Z

In http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder, the paper text says

the output of each sub-layer is LayerNorm(x+Sublayer(x))... We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

but the code is
```
return x + self.dropout(sublayer(self.norm(x)))
```
It seems it should be
```
return self.norm(x + self.dropout(sublayer(x)))
```
instead.
In Encoder and Decoder, where does the extra norm on top of the stack come from?
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by sqrt(dmodel).

It's described in http://nlp.seas.harvard.edu/2018/04/03/attention.html#additional-components-bpe-search-averaging, but it may be better to link that section from the quoted part, I couldn't find it initially.
Should http://nlp.seas.harvard.edu/2018/04/01/attention.html link to the updated version http://nlp.seas.harvard.edu/2018/04/03/attention.html?

The text was updated successfully, but these errors were encountered:

alexeyr changed the title ~~SublayerConnection definition in "The Annotated Transformer"~~ Possible issues in "The Annotated Transformer" Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issues in "The Annotated Transformer" #6

Possible issues in "The Annotated Transformer" #6

alexeyr commented Jun 11, 2019 •

edited

Loading

Possible issues in "The Annotated Transformer" #6

Possible issues in "The Annotated Transformer" #6

Comments

alexeyr commented Jun 11, 2019 • edited Loading

alexeyr commented Jun 11, 2019 •

edited

Loading