Awesome-Omni-Large-Models-and-Datasets

🔥 Omni large models and datasets for understanding and generating multi-modalities.

Awesome-Omni-Large-Models-and-Datasets

Table of contents generated with markdown-toc

😎Models

🗒️ Taxonomy

🕹️ Modality Understanding

*The last four columns represent the input modalities supported by the model.

Title	Model	Checkpoint	Text	Image	Audio	Video
OMCAT: Omni Context Aware Transformer	OMCAT	unreleased	✓	✓	✓	✓
Baichuan-Omni Technical Report	Baichuan-Omni		✓	✓	✓	✓
VITA: Towards Open-Source Interactive Omni Multimodal LLM	VITA	unreleased	✓	✗	✓	✗
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	VideoLLaMA 2		✓	✓	✓	✓
GroundingGPT:Language Enhanced Multi-modal Grounding Model	GroundingGPT		✓	✓	✓	✓
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST		✓	✓	✓	✓
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	VALOR		✓	✓	✓	✓

🧙 Modality Generation

*The last four columns represent the output modalities supported by the model.

Title	Model	Checkpoint	Text	Image	Audio	Video
-	-	-	-	-	-	-

🌈 Unified Model for Understanding and Generating Modalities

*The last four columns represent the input & output modalities supported by the model.

Title	Model	Checkpoint	Text	Image	Audio	Video
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	Janus		✓	✓	✗	✗
Emu3: Next-Token Prediction is All You Need	emu3		✓	✓	✓	✗
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	Unreleased	Unreleased	✓	✓	✗	✓
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	Show-o		✓	✓	✗	✗
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	Transfusion	unreleased	✓	✓	✗	✗
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	VITRON		✓	✓	✓	✗

✨️Datasets

Pretraining Dataset

Training Dataset

Dataset Name	Paper	Link	Audio-Image-Text	Speech-Video-Text	Audio-Video-Text	Detail
OCTAV	OMCAT: Omni Context Aware Transformer	unreleased	✗	✗	✓	OCTAV-ST has127,507 unique videos with single QA pairs; OCTAV-MT 25,457 unique videos with a total of 180,916 QA pairs.
VAST-27M	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	✗	✗	✓	27M Clips; 297M Captions.
VALOR-1M	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	链接	✗	✗	✓	描述

Benchmark

Name	Paper	Link	SI:Text	SI:Image	SI:Audio	SI:Video	SO:Text	Detail
OmnixR	OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities	Unreleased	✓	✓	✓	✓	✓	$\text{Omnix}R_\text{synth} $:100videos $\text{Omnix}R_\text{real} $:100videos
OmniBench	OmniBench: Towards The Future of Universal Omni-Language Models		✓	✓	✓	✗	✗
VALOR-32K	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	链接	✗	✗	✓	描述

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Omni-Large-Models-and-Datasets

😎Models

🗒️ Taxonomy

🕹️ Modality Understanding

🧙 Modality Generation

🌈 Unified Model for Understanding and Generating Modalities

✨️Datasets

Pretraining Dataset

Training Dataset

Benchmark

🌟 Star History

♥️ Contributors

About

Releases

Packages

Contributors 2

LJungang/Awesome-Omni-Large-Models-and-Datasets

Folders and files

Latest commit

History

Repository files navigation

Awesome-Omni-Large-Models-and-Datasets

😎Models

🗒️ Taxonomy

🕹️ Modality Understanding

🧙 Modality Generation

🌈 Unified Model for Understanding and Generating Modalities

✨️Datasets

Pretraining Dataset

Training Dataset

Benchmark

🌟 Star History

♥️ Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages