GitHub - RobertoTCo/ViT-implementation: Evaluation of ViT network implemented with Pytorch following ViT-Pytorch package.

Evaluation of a ViT network implemented with Pytorch based on lucidrain´s code.

The original ViT network is described in 2020, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale The ViT implementation is tested with MNIST dataset for numerical digits recognition. With 5 epochs training using batches of 128 images, the network achieves ~92% accuracy in the test set.

The file ViT.py develops the ViT network and explain the different layers and processing of the images (processing to patches, positional encoding and concatenation of the class token, transformer module, and final MLP head for classification.
The file Transformer_nn.py develops the Transformer-Encoder block architecture that is implemented in ViT. Consists of implementing blocks of Multi-head attention + MLP layers as described in the paper. The Multi-head attention layer implements the Scaled Dot-Product Attention as described in the original 2017 Attention paper without the masking. The MLP layer applies a GeLU activation.
The notebook Training with MNIST.ipynb implements a quick example of using the ViT for training with the MNIST dataset during 5 epochs, achieving 91.66% accuracy in the test data.
The notebook Training with MNIST features extraction.ipynb implements a quick example of using the ViT + resnet18 for feature extractor of the MNIST dataset, training during 5 epochs, achieving test accuracy of 99.08%.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
Training with MNIST features extraction.ipynb		Training with MNIST features extraction.ipynb
Training with MNIST.ipynb		Training with MNIST.ipynb
Transformer_nn.py		Transformer_nn.py
ViT.py		ViT.py
requirements.txt		requirements.txt

Provide feedback