Skip to content

Evaluation of ViT network implemented with Pytorch following ViT-Pytorch package.

Notifications You must be signed in to change notification settings

RobertoTCo/ViT-implementation

Repository files navigation

Evaluation of a ViT network implemented with Pytorch based on lucidrain´s code.

The original ViT network is described in 2020, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale The ViT implementation is tested with MNIST dataset for numerical digits recognition. With 5 epochs training using batches of 128 images, the network achieves ~92% accuracy in the test set.

  • The file ViT.py develops the ViT network and explain the different layers and processing of the images (processing to patches, positional encoding and concatenation of the class token, transformer module, and final MLP head for classification.
  • The file Transformer_nn.py develops the Transformer-Encoder block architecture that is implemented in ViT. Consists of implementing blocks of Multi-head attention + MLP layers as described in the paper. The Multi-head attention layer implements the Scaled Dot-Product Attention as described in the original 2017 Attention paper without the masking. The MLP layer applies a GeLU activation.
  • The notebook Training with MNIST.ipynb implements a quick example of using the ViT for training with the MNIST dataset during 5 epochs, achieving 91.66% accuracy in the test data.
  • The notebook Training with MNIST features extraction.ipynb implements a quick example of using the ViT + resnet18 for feature extractor of the MNIST dataset, training during 5 epochs, achieving test accuracy of 99.08%.

About

Evaluation of ViT network implemented with Pytorch following ViT-Pytorch package.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published