Awesome Deep Learning

Enhancements

⭐ pdf Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 2015 @ Ioffe, Szegedy

Sequence Processing (NLP, Time Series)

pdf BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2019 @ Delvin et al

Transformer-based architecture.

⭐ pdf Attention Is All You Need 2017 @ Vaswani et al

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, the second is a positionwise fully connected. Residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)). To facilitate residual connections, all layers in the model, produce outputs of dimension model = 512.