π₯ Omni large models and datasets for understanding and generating multi-modalities.
Table of contents generated with markdown-toc
*The last four columns represent the input modalities supported by the model.
*The last four columns represent the output modalities supported by the model.
Title | Model | Checkpoint | Text | Image | Audio | Video |
---|---|---|---|---|---|---|
- | - | - | - | - | - | - |
*The last four columns represent the input & output modalities supported by the model.
Dataset Name | Paper | Link | Audio-Image-Text | Speech-Video-Text | Audio-Video-Text | Detail |
---|---|---|---|---|---|---|
OCTAV | OMCAT: Omni Context Aware Transformer | unreleased | β | β | β | OCTAV-ST has127,507 unique videos with single QA pairs; OCTAV-MT 25,457 unique videos with a total of 180,916 QA pairs. |
VAST-27M | VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | VAST | β | β | β | 27M Clips; 297M Captions. |
VALOR-1M | VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | ιΎζ₯ | β | β | β | ζθΏ° |