Demo Video(testing app on iPhone 13 Pro)
This project implements an image captioning model. The model uses a Vision Transformer (ViT) as the encoder and a Transformer decoder as the decoder. It's trained on the COCO2017 dataset. The trained models are then deployed to an iOS app built with SwiftUI. You can test the app by cloning this repository or just try out the model with uplaoded weights.
The image is first processed by a ViT_b_32 model (with the classification layer removed). This outputs a tensor of size (N, 768). This tensor is then reshaped to (N, 32, 768) to be compatible with the CrossMultiHeadAttention block in the decoder. The decoder itself has 44.3 million parameters. It takes an input with a shape of (N, max_length=32) and outputs a tensor with a shape of (N, max_length=32, vocab_size). BERT tokenizer is used for text preprocessing.
For more details about training the model, loading model with weights and doing inference, please check out this notebook: Google Colab
For details on converting these models into CoreML models and testing them, please refer to the following resources:
Converting two PyTorch models into CoreML models
- Create a folder on your local machine and navigate to it using your terminal.
- Clone this repository using the following command:
git clone https://github.com/seungjun-green/PicScribe.git
- Navigate to the project directory:
cd PicScribe/Pic_Scribe.xcodeproj
- Open the Pic Scribe.xcodeproj file using Xcode.