Xiangxiang Chu2†, Richong Zhang1‡
Clone this repository
git clone [email protected]:lerogo/MMGenBench.git
cd MMGenBench
Download dataset
huggingface-cli download --repo-type dataset lerogo/MMGenBench --local-dir MMGenBench-data
Install the relevant environment, including torch, transformers, diffusers and unicom (used to extract image representation).
We use the InternVL2-2B
as an example. The structure of the code and data is as follows.
.
├── MMGenBench-data # The MMGenBench-Test/Domain dataset we downloaded from huggingface
│ ├── MMGenBench-Domain.json
│ ├── MMGenBench-Domain.tsv
│ ├── MMGenBench-Test-label-count.json
│ ├── MMGenBench-Test-label-index.json
│ ├── MMGenBench-Test.json
│ ├── MMGenBench-Test.tsv
│ ├── README.md
│ └── check.py
├── README.md # This file
├── evalimg # For extracting features and calculating metrics using the image representation model
│ ├── metric_fid.py
│ ├── output
│ │ ├── InternVL2-2B_MMGenBench-Domain.json
│ │ └── InternVL2-2B_MMGenBench-Test.json
│ ├── requirements.txt
│ ├── run.py
│ └── run.sh
├── generate # For processing LMMs' output with the text-to-image models
│ ├── flux.py
│ ├── input
│ │ ├── InternVL2-2B_MMGenBench-Domain.xlsx
│ │ └── InternVL2-2B_MMGenBench-Test.xlsx
│ ├── kolors.py
│ ├── lumina.py
│ ├── output
│ │ ├── InternVL2-2B_MMGenBench-Domain.tsv
│ │ └── InternVL2-2B_MMGenBench-Test.tsv
│ ├── requirements.txt
│ ├── run.py
│ ├── run.sh
│ ├── sd.py
│ └── tools.py
└── visual # For visualization
├── outputs
│ ├── InternVL2-2B_MMGenBench-Domain.json
│ ├── InternVL2-2B_MMGenBench-Domain.xlsx
│ ├── InternVL2-2B_MMGenBench-Test.json
│ └── InternVL2-2B_MMGenBench-Test.xlsx
├── run.py
└── run.sh
Adapt your model in VLMEvalKit and use MMGenBench for inference.
Run command:
torchrun --nproc-per-node=4 run.py --model <YOUR LMM> --data MMGenBench-Test MMGenBench-Domain --mode infer --verbose
We use the InternVL2-2B
as an example. Then you can get two files: InternVL2-2B_MMGenBench-Test.xlsx
, InternVL2-2B_MMGenBench-Domain.xlsx
. Put them in folder ./generate/input
Modify ./generate/run.sh
to select the text-to-image model and to select the number of GPUs you need to use.
And run:
cd generate
bash run.sh
Then you can get two files: ./generate/output/InternVL2-2B_MMGenBench-Test.tsv
, ./generate/output/InternVL2-2B_MMGenBench-Domain.tsv
We will use the unicom model to extract features from the original images and generated images, you need to install unicom (https://github.com/deepglint/unicom).
Modify ./evalimg/run.sh
to evaluate the performance on MMGenBench-Test and MMGenBench-Domain respectively.
And run:
cd evalimg
bash run.sh
Then you can get two files: evalimg/output/InternVL2-2B_MMGenBench-Test.json
, ./evalimg/output/InternVL2-2B_MMGenBench-Domain.json
.
Run command:
cd visual
bash run.sh
You can see the relevant results in the output
folder, including metrics and visualization results.
If you have any questions, please submit an issue or contact lerogohl<AT>gmail.com.
If you find MMGenBench or code useful, please cite
@misc{huang2024MMGenBench,
title={MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective},
author={Hailang Huang and Yong Wang and Zixuan Huang and Huaqiu Li and Tongwen Huang and Xiangxiang Chu and Richong Zhang},
year={2024},
eprint={2411.14062},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.14062},
}