Skip to content

πŸ”₯ Omni large models and datasets for understanding and generating multi-modalities.

Notifications You must be signed in to change notification settings

LJungang/Awesome-Omni-Large-Models-and-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 

Repository files navigation

Awesome-Omni-Large-Models-and-DatasetsAwesome

πŸ”₯ Omni large models and datasets for understanding and generating multi-modalities.

Table of contents generated with markdown-toc

😎Models

πŸ—’οΈ Taxonomy

πŸ•ΉοΈ Modality Understanding

*The last four columns represent the input modalities supported by the model.

Title Model Checkpoint Text Image Audio Video
OMCAT: Omni Context Aware Transformer arXiv OMCAT project_repo unreleased βœ“ βœ“ βœ“ βœ“
Baichuan-Omni Technical Report arXiv Baichuan-Omni project_repo hf_checkpoint βœ“ βœ“ βœ“ βœ“
VITA: Towards Open-Source Interactive Omni Multimodal LLM arXiv VITA project_repo unreleased βœ“ βœ— βœ“ βœ—
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs arXiv VideoLLaMA 2 project_repo github_model_zoos βœ“ βœ“ βœ“ βœ“
GroundingGPT:Language Enhanced Multi-modal Grounding Model arXiv GroundingGPT project_repo github_model_zoos βœ“ βœ“ βœ“ βœ“
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset NeurIPS VAST project_repo github_model_zoos βœ“ βœ“ βœ“ βœ“
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetarXiv VALORproject_repo github_model_zoos βœ“ βœ“ βœ“ βœ“

πŸ§™ Modality Generation

*The last four columns represent the output modalities supported by the model.

Title Model Checkpoint Text Image Audio Video
- - - - - - -

🌈 Unified Model for Understanding and Generating Modalities

*The last four columns represent the input & output modalities supported by the model.

Title Model Checkpoint Text Image Audio Video
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation arXiv Janus project_repo hf_checkpoint βœ“ βœ“ βœ— βœ—
Emu3: Next-Token Prediction is All You Need arXiv emu3 project_repo hf_checkpointms_checkpoint
hf_checkpointms_checkpoint
hf_checkpointms_checkpoint
βœ“ βœ“ βœ“ βœ—
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation arXiv Unreleased Unreleased βœ“ βœ“ βœ— βœ“
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation arXiv Show-o project_repo github_model_zoos βœ“ βœ“ βœ— βœ—
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model arXiv Transfusion project_repo unreleased βœ“ βœ“ βœ— βœ—
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editingnips-2024 VITRON project_repo github_model_zoos βœ“ βœ“ βœ“ βœ—

✨️Datasets

Pretraining Dataset

Training Dataset

Dataset Name Paper Link Audio-Image-Text Speech-Video-Text Audio-Video-Text Detail
OCTAV OMCAT: Omni Context Aware Transformer arXiv unreleased βœ— βœ— βœ“ OCTAV-ST has127,507 unique videos with single QA pairs;
OCTAV-MT 25,457 unique videos with a total of 180,916 QA pairs.
VAST-27M VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset NeurIPS VAST project_repo βœ— βœ— βœ“ 27M Clips;
297M Captions.
VALOR-1M VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetarXiv ι“ΎζŽ₯ βœ— βœ— βœ“ 描述

Benchmark

Name Paper Link SI:Text SI:Image SI:Audio SI:Video SO:Text Detail
OmnixR OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities arXiv Unreleased βœ“ βœ“ βœ“ βœ“ βœ“ $\text{Omnix}R_\text{synth} $:100videos
$\text{Omnix}R_\text{real} $:100videos
OmniBench OmniBench: Towards The Future of Universal Omni-Language ModelsarXiv hf_checkpoint
hf_checkpoint
βœ“ βœ“ βœ“ βœ— βœ—
VALOR-32K VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetarXiv ι“ΎζŽ₯ βœ— βœ— βœ“ 描述

🌟 Star History

Star History Chart

β™₯️ Contributors

Contributors for Awesome Omni Large Models and Datasets