You can use script/merge_lora_weights.py
to merge the LoRA weights and base LLM, and then evaluate it as evaluation_full.md.
python script/merge_lora_weights.py \
--model-path /path/to/bunny_lora_weights \
--model-base /path/to/base_llm_model \
--model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \
--save-model-path /path/to/merged_model
Or you can evaluate it without merging as below.
Note that change conv-mode
to minicpm/phi3/llama
for MODEL_TYPE = minicpm/phi-3/llama3-8b
.
- Refer to MME GitHub to download the benchmark dataset and put
MME_Benchmark_release_version
undereval/mme
. - Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/mme.sh
The responses and scores can be found in eval/mme/answers_upload
.
- Refer to MMBench GitHub to download the benchmark dataset. We support
MMBench-Dev
,MMBench-Test
,MMBench-Dev (cn)
andMMBench-Test (cn)
. Please note that only the files downloaded by legacy link are supported. PutMMBench_DEV_EN_legacy.tsv
,MMBench_TEST_EN_legacy.tsv
,MMBench_DEV_CN_legacy.tsv
orMMBench_TEST_CN_legacy.tsv
undereval/mmbench
. - Update
SPLIT
,LANG (en/cn)
,MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/mmbench.sh
The response file can be found in eval/mmbench/answers_upload
. You can submit the Excel file to submission link to obtain the evaluation scores.
-
Refer to SEED-Bench Instruction to download the images and videos and put the images under
eval/seed-bench/SEED-Bench-image
and the videos undereval/seed-bench/SEED-Bench-video
. Then, extract the video frames in the middle from the downloaded videos by running:pip install av decord python eval/seed-bench/extract_video_frames.py
-
Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/lora/seedbench.sh
The response file can be found in eval/seed-bench/answers_upload
and the scores can be found in eval/seed-bench/scores
.
- Refer to MMMU HuggingFace to download the benchmark dataset and put
MMMU
undereval/mmmu
. - Update
SPLIT
,MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly. You may add--small-gpu-usage
to avoidCUDA out of memory
.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/mmmu.sh
The response file can be found in eval/mmmu/answers_upload
.
For validation set, you can use eval_mmmu.py
to obtain the scores.
python eval/mmmu/eval_mmmu.py \
--output-path ./eval/mmmu/answers_upload/$SPLIT/$TARGET_DIR.json
For test set, you can submit the json
response file to submission_link to obtain the evaluation scores.
- Refer to CMMMU HuggingFace to download the benchmark dataset and put
CMMMU
undereval/cmmmu
. - Update
SPLIT
,MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly. You may add--small-gpu-usage
to avoidCUDA out of memory
.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/cmmmu.sh
The response file can be found in eval/cmmmu/answers_upload
.
For validation set, you can use eval_script.py
to obtain the scores.
python eval/cmmmu/eval_script.py \
--output_path ./eval/cmmmu/answers_upload/$SPLIT/$TARGET_DIR.jsonl
For test set, you can submit the jsonl
response file to submission_link to obtain the evaluation scores.
-
Download COCO 2015 Test images and put
test2015
undereval/vqav2
. Then:tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz
-
Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/lora/vqav2.sh
The response file can be found in eval/vqav2/answers_upload
. You can submit the json
response file to submission link (Test-Dev Phase) to obtain the evaluation scores.
-
Download the images of GQA, unzip it and put
images
undereval/gqa
. Then:tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz
-
Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/lora/gqa.sh
- Refer to ScienceQA Google Drive to download
test.zip
,problems.json
andpid_splits.json
, unziptest.zip
and put them undereval/scienceqa
. - Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/scienceqa.sh
The responses and the scores can be found in eval/scienceqa/results
.
- Download COCO 2014 Val images and put
val2014
undereval/pope
. Then, refer to POPE GitHub to download the benchmark dataset and put the threejson
files undereval/pope/coco
. - Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/pope.sh
We report the averaged F1-score of three categories (random, popular and adversarial).
- Refer to MM-Vet Github to download the benchmark dataset and put
images
undereval/mm-vet
. - Update
MODEL_TYPE
,MODEL_BASE
andTARGET_DIR
accordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/mmvet.sh
The response file can be found in eval/mm-vet/answers_upload
. You can submit the json
response file to submission link to obtain the evaluation scores.
SpatialBench is proposed in SpatialBot. It tests models' performance on spatial understanding and reasoning.
- Download dataset in HuggingFace.
- Please refer to SpatialBot GitHub for evaluation codes.