Skip to content

PKU-YuanGroup/LLMBind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for latest update.

arXiv License Hits GitHub issues GitHub closed issues

πŸ’‘ I also have other multi-modal projects that may interest you ✨.

Open-Sora-Plan
PKU-Yuan Lab and Tuzhan AI etc.
github github

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
github github arXiv

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
github github arXiv

News

  • [2024.06.15] πŸ€— Huggingface demo will be available soon! Welcome to watch πŸ‘€ this repository for the latest updates.
  • [2024.06.15] πŸ€— We have release part of our interactive generation and editing dataset in Huggingface.

Highlights

LLMBind demonstrates promising results in advancing the development of human-like MLLM and AI agents.

πŸ”₯ A unified model integration framework

  • We design a unified model integration framework that expands task-specific tokens for diverse modality tasks, thus easily integrating different tasks into a unified LLM, where we introduce the MoE technique in our framework to better handle diverse modality tasks.

πŸ”₯ A unified MLLM with various modality tasks

  • We propose a unified MLLM that is compatible with various modality tasks, including image segmentation, image generation, image editing, video generation, and audio generation.

πŸ”₯ Interactive generation and editing datasets

  • To facilitate the development of user-friendly interactive tasks, we construct a dataset of 400k interactive generation and editing multi-turn dialogues using ChatGPT. We plan to release this dataset as an open resource to foster collaborative advancements in this field.

1. Installation

git clone https://github.com/PKU-YuanGroup/LLMBind
cd LLMBind
conda create -n llmbind python=3.8 -y
conda activate llmbind
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Dataset preparation

2.1 Interactive generation and editing dataset:

Download them from LLMBind-GPT-Interactive-Data, and the llmbind_dataset folder.

β”œβ”€β”€ llmbind_dataset
β”‚Β Β  β”œβ”€β”€ interactive_dataset
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ interactive_audio_t2x_format.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ interactive_image_t2x_format.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ interactive_video_t2x_format.json
β”‚Β Β  β”‚Β Β  └── interactive_generation_and_editing_format.json

2.2 Reasoning segmentation & Refering segmentation & VQA dataset:

Download them put them into the llmbind_dataset folder.

β”œβ”€β”€ llmbind_dataset
β”‚Β Β  β”œβ”€β”€ ade20k
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ annotations
β”‚Β Β  β”‚Β Β  └── images
β”‚Β Β  β”œβ”€β”€ coco
β”‚Β Β  β”‚Β Β  └── train2017
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 000000000009.jpg
β”‚Β Β  β”‚Β Β      └── ...
β”‚Β Β  β”œβ”€β”€ cocostuff
β”‚Β Β  β”‚Β Β  └── train2017
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 000000000009.png
β”‚Β Β  β”‚Β Β      └── ...
β”‚Β Β  β”œβ”€β”€ llava_dataset
β”‚Β Β  β”‚Β Β  └── llava_v1_5_mix665k.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ llava_instruct_150k.json
β”‚Β Β  β”œβ”€β”€ mapillary
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ config_v2.0.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ testing
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ training
β”‚Β Β  β”‚Β Β  └── validation
β”‚Β Β  β”œβ”€β”€ reason_seg
β”‚Β Β  β”‚Β Β  └── ReasonSeg
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ train
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ val
β”‚Β Β  β”‚Β Β      └── explanatory
β”‚Β Β  β”œβ”€β”€ refer_seg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ images
β”‚Β Β  β”‚Β Β  |   β”œβ”€β”€ saiapr_tc-12 
β”‚Β Β  β”‚Β Β  |   └── mscoco
β”‚Β Β  β”‚Β Β  |       └── images
β”‚Β Β  β”‚Β Β  |           └── train2014
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refclef
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refcoco
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refcoco+
β”‚Β Β  β”‚Β Β  └── refcocog
β”‚Β Β  └── vlpart
β”‚Β Β      β”œβ”€β”€ paco
β”‚       β”‚   └── annotations
β”‚Β Β      └── pascal_part
β”‚Β Β          β”œβ”€β”€ train.json
β”‚           └── VOCdevkit

3. Pre-trained weights

3.1 LLaVA weights

To train LLMBind-7B, you need to follow the instruction to merge the LLaVA delta weights. Typically, we use the final weights LLaVA-Lightning-7B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1.

from huggingface_hub import snapshot_download
repo_name = 'liuhaotian/LLaVA-Lightning-7B-delta-v1-1'
snapshot_download(repo_id=repo_name, local_dir="models/liuhaotian/LLaVA-Lightning-7B-delta-v1-1", local_dir_use_symlinks=False, max_workers=1 )

repo_name = 'yahma/llama-7b-hf'
snapshot_download(repo_id=repo_name, local_dir="models/yahma/llama-7b-hf", local_dir_use_symlinks=False, max_workers=1 )

cd model
PATH_TO_LLAMA_7B=/storage/zhubin/LLMBind/models/yahma/llama-7b-hf
PATH_TO_LLAVA_DELTA=/storage/zhubin/LLMBind/models/liuhaotian/LLaVA-Lightning-7B-delta-v1-1
TARGET_PATH=models/LLaVA-7B-v1-1
python3 -m model.apply_delta \
    --base $PATH_TO_LLAMA_7B \
    --target $TARGET_PATH  \
    --delta $PATH_TO_LLAVA_DELTA

3.2 SAM weights

Download SAM ViT-H pre-trained weights from sam_vit_h_4b8939.pth

4. Training

PATH_TO_LLaVA="PATH_TO_LLaVA"
PATH_TO_SAM="PATH_TO_SAM"
deepspeed --include localhost:0,1,2,3,4,5,6,7,8 train_ds.py \
  --version=$PATH_TO_LLaVA \
  --dataset_dir='./llmbind_dataset' \
  --vision_pretrained=$PATH_TO_SAM \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="llmbind-7b" \
  --steps_per_epoch 500 \
  --epochs 10 \
  --batch_size  16 \
  --model_max_length  768 \
  --add_generation_token \
  --add_edit_token \
  --add_video_generation_token \
  --add_audio_generation_token \
  --vqa_sample_rates='2,70,70,70' \
  --vqa_data "interactive_generation_and_editing_format.json||interactive_video_t2x_format||interactive_image_t2x_format||interactive_audio_t2x_format" \

5.1 Merge LoRA Weight

When training is finished, to get the full model weight:

cd ./runs/llmbind-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="PATH_TO_LLaVA" \
  --weight="PATH_TO_pytorch_model.bin" \
  --save_path="PATH_TO_SAVED_MODEL"

6. Inference

  • To chat with LLMBind:
CUDA_VISIBLE_DEVICES=0 python chat.py --version="PATH_TO_SAVED_HF_MODEL"

For example:

HF_DATASETS_OFFLINES=1 CUDA_VISIBLE_DEVICES=7 python chat.py --version="runs/llmbind-7b/hf_weights"

Acknowledgement

License

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation πŸ“.

@article{zhu2024llmbind,
  title={LLMBind: A Unified Modality-Task Integration Framework},
  author={Zhu, Bin and Jin, Peng and Ning, Munan and Lin, Bin and Huang, Jinfa and Song, Qi and Pan, Mingjun and Yuan, Li},
  journal={arXiv preprint arXiv:2402.14891},
  year={2024}
}

🀝 Contributors

About

LLMBind: A Unified Modality-Task Integration Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages