VeLAR: Vision-oriEnted Language-Attentive token Reduction for multimodal large language models

Yizheng Sun; Yanze Xin; Hao Li; Chenghua Lin; Riza Batista-Navarro

VeLAR: Vision-oriEnted Language-Attentive token Reduction for multimodal large language models

Yizheng Sun, Yanze Xin, Hao Li, Chenghua Lin, Riza Batista-Navarro

26 Sept 2024 (modified: 11 Oct 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Large Language Models, Token Reduction, Model Aceleration, Foundation Models, Vision-Language Learning, Instruction Tuning

TL;DR: We propose a token reduction framework for MLLMs that reduces vision token redundancy in vision-language learning, cutting computational costs by up to 42% while maintaining and even surpassing the original model performance.

Abstract: Multi-modal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. However, architectures that pass all vision tokens to the language model (LLM), such as LLaVA, incur high computational costs due to the large number of vision tokens. While approaches using Q-formers as vision-language connectors reduce computational overhead by generating fewer vision tokens, they often experience performance degradation. In this paper, we propose a progressive token reduction method, called VeLAR, that retains the performance of LLaVA-based MLLMs while substantially reducing computational load. We introduce a lightweight cross-attention decision module where vision tokens attend to language tokens. This module is inserted into various layers of the LLM to compute a relevance score for each vision token, dynamically determining whether to prune it. During training, we apply a targeting pruning ratio with Gumbel-Softmax activation to maintain differentiability in the pruning process by attention masking, while in inference, the pruning ratio can be flexibly adjusted to consider different computational trade-offs without re-training. By progressively pruning redundant vision tokens throughout the LLM backbone, our method can reduce 87.5% vision tokens by the final layer and achieve up to a 42% decrease in FLOPs. Across 12 multi-modal benchmarks, the average performance loss is less than 1%, with superior performance observed in 7 of them.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7058

Loading