GitHub - PKU-YuanGroup/LLMBind: LLMBind: A Unified Modality-Task Integration Framework

LLMBind: A Unified Modality-Task Integration Framework

If you like our project, please give us a star ⭐ on GitHub for latest update.

💡 I also have other multi-modal projects that may interest you ✨.

Open-Sora-Plan
PKU-Yuan Lab and Tuzhan AI etc.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

News

[2024.06.15] 🤗 Huggingface demo will be available soon! Welcome to watch 👀 this repository for the latest updates.
[2024.06.15] 🤗 We have release part of our interactive generation and editing dataset in Huggingface.

Highlights

LLMBind demonstrates promising results in advancing the development of human-like MLLM and AI agents.

🔥 A unified model integration framework

We design a unified model integration framework that expands task-specific tokens for diverse modality tasks, thus easily integrating different tasks into a unified LLM, where we introduce the MoE technique in our framework to better handle diverse modality tasks.

🔥 A unified MLLM with various modality tasks

We propose a unified MLLM that is compatible with various modality tasks, including image segmentation, image generation, image editing, video generation, and audio generation.

🔥 Interactive generation and editing datasets

To facilitate the development of user-friendly interactive tasks, we construct a dataset of 400k interactive generation and editing multi-turn dialogues using ChatGPT. We plan to release this dataset as an open resource to foster collaborative advancements in this field.

1. Installation

git clone https://github.com/PKU-YuanGroup/LLMBind
cd LLMBind
conda create -n llmbind python=3.8 -y
conda activate llmbind
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Dataset preparation

2.1 Interactive generation and editing dataset:

Download them from LLMBind-GPT-Interactive-Data, and the llmbind_dataset folder.

├── llmbind_dataset
│   ├── interactive_dataset
│   │   ├── interactive_audio_t2x_format.json
│   │   ├── interactive_image_t2x_format.json
│   │   ├── interactive_video_t2x_format.json
│   │   └── interactive_generation_and_editing_format.json

2.2 Reasoning segmentation & Refering segmentation & VQA dataset:

Download them put them into the llmbind_dataset folder.

├── llmbind_dataset
│   ├── ade20k
│   │   ├── annotations
│   │   └── images
│   ├── coco
│   │   └── train2017
│   │       ├── 000000000009.jpg
│   │       └── ...
│   ├── cocostuff
│   │   └── train2017
│   │       ├── 000000000009.png
│   │       └── ...
│   ├── llava_dataset
│   │   └── llava_v1_5_mix665k.json
│   │   ├── llava_instruct_150k.json
│   ├── mapillary
│   │   ├── config_v2.0.json
│   │   ├── testing
│   │   ├── training
│   │   └── validation
│   ├── reason_seg
│   │   └── ReasonSeg
│   │       ├── train
│   │       ├── val
│   │       └── explanatory
│   ├── refer_seg
│   │   ├── images
│   │   |   ├── saiapr_tc-12 
│   │   |   └── mscoco
│   │   |       └── images
│   │   |           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   └── refcocog
│   └── vlpart
│       ├── paco
│       │   └── annotations
│       └── pascal_part
│           ├── train.json
│           └── VOCdevkit

3. Pre-trained weights

3.1 LLaVA weights

To train LLMBind-7B, you need to follow the instruction to merge the LLaVA delta weights. Typically, we use the final weights LLaVA-Lightning-7B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1.

from huggingface_hub import snapshot_download
repo_name = 'liuhaotian/LLaVA-Lightning-7B-delta-v1-1'
snapshot_download(repo_id=repo_name, local_dir="models/liuhaotian/LLaVA-Lightning-7B-delta-v1-1", local_dir_use_symlinks=False, max_workers=1 )

repo_name = 'yahma/llama-7b-hf'
snapshot_download(repo_id=repo_name, local_dir="models/yahma/llama-7b-hf", local_dir_use_symlinks=False, max_workers=1 )

cd model
PATH_TO_LLAMA_7B=/storage/zhubin/LLMBind/models/yahma/llama-7b-hf
PATH_TO_LLAVA_DELTA=/storage/zhubin/LLMBind/models/liuhaotian/LLaVA-Lightning-7B-delta-v1-1
TARGET_PATH=models/LLaVA-7B-v1-1
python3 -m model.apply_delta \
    --base $PATH_TO_LLAMA_7B \
    --target $TARGET_PATH  \
    --delta $PATH_TO_LLAVA_DELTA

3.2 SAM weights

Download SAM ViT-H pre-trained weights from sam_vit_h_4b8939.pth

4. Training

PATH_TO_LLaVA="PATH_TO_LLaVA"
PATH_TO_SAM="PATH_TO_SAM"
deepspeed --include localhost:0,1,2,3,4,5,6,7,8 train_ds.py \
  --version=$PATH_TO_LLaVA \
  --dataset_dir='./llmbind_dataset' \
  --vision_pretrained=$PATH_TO_SAM \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="llmbind-7b" \
  --steps_per_epoch 500 \
  --epochs 10 \
  --batch_size  16 \
  --model_max_length  768 \
  --add_generation_token \
  --add_edit_token \
  --add_video_generation_token \
  --add_audio_generation_token \
  --vqa_sample_rates='2,70,70,70' \
  --vqa_data "interactive_generation_and_editing_format.json||interactive_video_t2x_format||interactive_image_t2x_format||interactive_audio_t2x_format" \

5.1 Merge LoRA Weight

When training is finished, to get the full model weight:

cd ./runs/llmbind-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="PATH_TO_LLaVA" \
  --weight="PATH_TO_pytorch_model.bin" \
  --save_path="PATH_TO_SAVED_MODEL"

6. Inference

To chat with LLMBind:

CUDA_VISIBLE_DEVICES=0 python chat.py --version="PATH_TO_SAVED_HF_MODEL"

For example:

HF_DATASETS_OFFLINES=1 CUDA_VISIBLE_DEVICES=7 python chat.py --version="runs/llmbind-7b/hf_weights"

Acknowledgement

This work is built upon the LLaVA, SAM and LISA.

License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{zhu2024llmbind,
  title={LLMBind: A Unified Modality-Task Integration Framework},
  author={Zhu, Bin and Jin, Peng and Ning, Munan and Lin, Bin and Huang, Jinfa and Song, Qi and Pan, Mingjun and Yuan, Li},
  journal={arXiv preprint arXiv:2402.14891},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
docs		docs
model		model
uesless		uesless
useless		useless
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat.py		chat.py
chat_gradio.py		chat_gradio.py
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
requirements.txt		requirements.txt
train_ds.py		train_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMBind: A Unified Modality-Task Integration Framework

If you like our project, please give us a star ⭐ on GitHub for latest update.

News

Highlights

🔥 A unified model integration framework

🔥 A unified MLLM with various modality tasks

🔥 Interactive generation and editing datasets

1. Installation

2. Dataset preparation

2.1 Interactive generation and editing dataset:

2.2 Reasoning segmentation & Refering segmentation & VQA dataset:

3. Pre-trained weights

3.1 LLaVA weights

3.2 SAM weights

4. Training

5.1 Merge LoRA Weight

6. Inference

Acknowledgement

License

Citation

🤝 Contributors

About

Releases

Packages

Languages

License

PKU-YuanGroup/LLMBind

Folders and files

Latest commit

History

Repository files navigation

LLMBind: A Unified Modality-Task Integration Framework

If you like our project, please give us a star ⭐ on GitHub for latest update.

News

Highlights

🔥 A unified model integration framework

🔥 A unified MLLM with various modality tasks

🔥 Interactive generation and editing datasets

1. Installation

2. Dataset preparation

2.1 Interactive generation and editing dataset:

2.2 Reasoning segmentation & Refering segmentation & VQA dataset:

3. Pre-trained weights

3.1 LLaVA weights

3.2 SAM weights

4. Training

5.1 Merge LoRA Weight

6. Inference

Acknowledgement

License

Citation

🤝 Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages