The official implementation of "2024 A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs".
Wangbo Zhao1, Yizeng Han2, Jiasheng Tang2,3, Zhikai Li1, Yibing Song2,3, Kai Wang1, Zhangyang Wang4, Yang You1
1National University of Singapore, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4The University of Texas at Austin
(a) Small VLM-guided visual token pruning in a large VLM (SGP). We update a global attention map aggregated from all layer of a small VLM. This global attention map is used to rank visual tokens and guide the visual token pruning in a large VLM.
(b) Aggregation of attention maps in SGP. We aggregate the attention score of visual tokens received from prompt tokens and generated tokens across all heads and layers in the small LM. Higher scores indicate greater significance.
(c) Inference with Small VLM Early Exiting (SEE). When the early exiting decision score from the small VLM is sufficient, the larger VLM will not be invoked.
-
Please refer to the documentation of InternVL to set up the environment and prepare the data for evaluation.
-
We take 'bash textvqa2B-26B.sh' as an example, which takes InternVL2-2B as the small model to accelerate the large model InternVL2-26B.
If you found our work useful, please consider citing us.
@article{zhao2024stitch,
title={A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs},
author={Zhao, Wangbo and Han, Yizeng and Tang, Jiasheng and Li, Zhikai and Song, Yibing and Wang, Kai and Wang, Zhangyang and You, Yang},
journal={arXiv preprint arXiv:2412.03324},
year={2024}
}
SGL is built with reference to the code of the following projects: InternVL, FastV, QWen2-VL, and LLaVa-OneVision.
🔥🔥🔥 If you are interested in this work and hope to cooperate with us, please drop an email to [email protected] 🔥🔥🔥