All models use hugging face checkpoints to get the pre-trained weights. Please note data paths are anonymised.
For implementation of the LLaVA model kindly refer to - https://github.com/haotian-liu/LLaVA
For implementation of Multimodal CoT model, kindly refer to - https://github.com/amazon-science/mm-cot