We present ShapeLLM, the first 3D Multimodal Large Language Model designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi and Kaisheng Ma
1. ShapeLLM is the first 3D Multimodal Large Language Model designed for embodied interaction
.
2. ShapeLLM supports single-view colored point cloud input
, which can be effortlessly obtained from RGBD cameras.
3. We introduce a robust 3D QA benchmark, 3D MM-Vet
, encompassing various variants including single-view, noise jitter, etc.
4. We extend the powerful point encoder architecture, ReCon++
, achieving state-of-the-art performance across a range of representation learning tasks.
- Clone this repository and navigate to ShapeLLM folder
git clone https://github.com/qizekun/ShapeLLM.git
cd ShapeLLM
- Install Package
conda create -n shapellm python=3.10 -y
conda activate shapellm
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install PointNet++
pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"
Please check out our Model Zoo for all public ShapeLLM checkpoints.
Chat about point clouds using CLI interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.
If you encounter issues accessing Huggingface, please use export HF_ENDPOINT=https://hf-mirror.com
.
python -m llava.serve.cli \
--model-path qizekun/ShapeLLM_13B_general_v1.0 \
--pts-file assets/instrument.npy
Consistent with LLaVA, we adopt a two-stage training approach. In the first stage, we solely fine-tune the projector for semantic alignment. In the second stage, we conduct full fine-tuning using Instruction Following data.
Download data following DATA, organize the data as follows in ./playground/data/shapellm/
,
│playground/data/shapellm/
├── cap3d_objaverse_785k.json
├── cap3d_objaverse_sft_45k.json
├── gapartnet_sft_27k_openai.json
├── gapartnet_pcs
│ ├── Box_100129_0_0.npy
│ └── ...
└── cap3d_pcs
├── 00000054c36d44a2a483bdbff31d8edf.pt
└── ...
Furthermore, ShapeLLM utilizes the Large version of ReCon++ as the point encoder.
You need to download the ReCon++ weight and save it to ./checkpoints/recon/large.pth
.
│checkpoints/recon/
└── large.pth
1. Feature Alignment Stage
sh scripts/pretrain.sh
2. Visual Instruction Tuning Stage
sh scripts/finetune.sh
The training takes around 14 hours for ShapeLLM-13B on 8x A100 (80G). It takes around 7 hours for ShapeLLM-7B.
Evaluate 3D MLLMs for integrated capabilities and embodied interaction capabilities, run the script:
sh scripts/eval/mmvet.sh
Using GPT-4 to calulate the 3D MM-Vet score:
sh scripts/eval/eval_mmvet.sh
Evaluate the performance of ShapeLLM on the GApartNet dataset, run the script:
sh scripts/eval/gapartnet_ref.sh
Calucate the generative 3D visual grounding accuracy:
sh scripts/eval/eval_gapartnet.sh
Please check out our Model Zoo for all public ReCon++ checkpoints.
Download and organize data following DATA.
If you encounter issues accessing Huggingface, please use export HF_ENDPOINT=https://hf-mirror.com
.
ReCon++ adopts a two-stage pre-training approach, initially conducting generative pre-training in either random or causal form, followed by cross-modal contrastive learning. It is worth noting that we employ a gradient stopping strategy for transfer learning tasks, while we do not use gradient stopping for zero-shot tasks.
sh ReConV2/scripts/pretrain_zeroshot/pretrain_reconstruct.sh <exp_name>
sh ReConV2/scripts/pretrain_transfer/pretrain_reconstruct.sh <exp_name>
sh ReConV2/scripts/pretrain_zeroshot/pretrain_contrast.sh <exp_name> <path/to/stage1-pre-trained/model>
sh ReConV2/scripts/pretrain_transfer/pretrain_contrast.sh <exp_name> <path/to/stage1-pre-trained/model>
Model | Version | OBJ_BG | OBJ_ONLY | PB_T50_RS | MN-40 1k | MN-40 8k |
---|---|---|---|---|---|---|
ACT | Small | 93.29% | 91.91% | 88.21% | 93.7% | 94.0% |
ReCon | Small | 95.35% | 93.80% | 91.26% | 94.5% | 94.7% |
PointGPT | Base | 95.8% | 95.2% | 91.9% | 94.4% | 94.6% |
ReCon++ | Base | 98.62% | 96.21% | 93.34% | 94.6% | 94.8% |
ReCon++ | Large | 98.80% | 97.59% | 95.25% | 94.8% | 95.0% |
Fine-tuning with the default configuration, run the script:
bash ReConV2/scripts/downstream/cls.sh <GPU> <exp_name> <path/to/pre-trained/model>
Test&Voting with the default configuration, run the script:
bash ReConV2/scripts/downstream/test.sh <GPU> <exp_name> <path/to/best/fine-tuned/model>
Model | Version | 5w10s (%) | 5w20s (%) | 10w10s (%) | 10w20s (%) |
---|---|---|---|---|---|
ACT | Small | 96.8 ± 2.3 | 98.0 ± 1.4 | 93.3 ± 4.0 | 95.6 ± 2.8 |
ReCon | Small | 97.3 ± 1.9 | 98.9 ± 1.2 | 93.3 ± 3.9 | 95.8 ± 3.0 |
PointGPT | Large | 98.0 ± 1.9 | 99.0 ± 1.0 | 94.1 ± 3.3 | 96.1 ± 2.8 |
ReCon++ | Large | 98.0 ± 2.3 | 99.5 ± 0.8 | 94.5 ± 4.1 | 96.5 ± 3.0 |
Few-shot with the default configuration, run the script:
sh ReConV2/scripts/downstream/fewshot.sh <GPU> <exp_name> <path/to/pre-trained/model> <way> <shot> <fold>
Model | Version | Objaverse-LVIS | ModelNet40 | ScanObjectNN |
---|---|---|---|---|
OpenShape | Base | 46.8% | 84.4% | 52.2% |
Uni3D | Base | 51.7% | 86.3% | 63.8% |
Uni3D | Large | 53.1% | 86.3% | 58.2% |
ReCon++ | Base | 53.2% | 86.5% | 63.6% |
ReCon++ | Large | 53.7% | 87.3% | 65.4% |
In the pre-training process, Zero-shot evaluation is enabled by default. Zero-shot with the default configuration, run the script:
bash ReConV2/scripts/downstream/zeroshot.sh <GPU> <exp_name> <path/to/pre-trained/model>
3D MM-Vet is a carefully crafted multi-level 3D QA benchmark that consists of 59 unique 3D models and 232 human-written questions and answers with rich content.
The test data and scripts have been uploaded to Hugging Face. You can also locate the evaluation scripts from the codebase of ShapeLLM.
Furthermore, we propose 3D MM-Vet-C, which contains three variants: single-view, jitter, and rotation. They represent extracting partial point clouds of the front view field of view, adding Gaussian noise to the point cloud xyz, and random rotation on the x, y, and z axes, respectively.
Here is a more detailed explanation of each variant:
- Single-view: This variant focuses on the model's ability to understand the 3D object from a single viewpoint. To create the single-view variant, we extract the front-view point cloud of each model.
- Jitter: This variant tests the model's robustness to noise. To create the jitter variant, we add Gaussian noise with zero mean and variance of 0.01 to the point cloud xyz.
- Rotation: This variant examines the model's ability to understand the 3D scene from different viewpoints. To create the rotation variant, we randomly apply 30 degrees of random rotation on the x, y, and z axes.
We believe that 3D MM-Vet and 3D MM-Vet-C are valuable resources for the 3D QA community. They can be used to evaluate the performance of existing models and to develop new models that are better at understanding and reasoning about 3D objects.
We use PointVisualizaiton repo to render beautiful point cloud images, including specified color rendering and attention distribution rendering.
This codebase is built upon LLaVA, OpenShape, ReCon and PointGPT.
If you find ShapeLLM or ReCon++ useful for your research and applications, please cite using this BibTeX:
@article{qi2024shapellm,
author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},
title = {ShapeLLM: Universal 3D Object Understanding for Embodied Interaction},
journal = {arXiv preprint arXiv:2402.17766},
year = {2024}
}
and closely related work ReCon and ACT:
@inproceedings{qi2023recon,
title={Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining},
author={Qi, Zekun and Dong, Runpei and Fan, Guofan and Ge, Zheng and Zhang, Xiangyu and Ma, Kaisheng and Yi, Li},
booktitle={International Conference on Machine Learning (ICML) },
url={https://openreview.net/forum?id=80IfYewOh1},
year={2023}
}
@inproceedings{dong2023act,
title={Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?},
author={Runpei Dong and Zekun Qi and Linfeng Zhang and Junbo Zhang and Jianjian Sun and Zheng Ge and Li Yi and Kaisheng Ma},
booktitle={The Eleventh International Conference on Learning Representations (ICLR) },
url={https://openreview.net/forum?id=8Oun8ZUVe8N},
year={2023}
}