ACT-IN-LLM: Adaptively Compression Vision Tokens in LLM for High-Resolution Multimodal Large Language Models

Xinpeng Ding; Lewei Yao; Jianhua Han; Lanqing HONG; Hang Xu; Wei Zhang; Xiaomeng Li

ACT-IN-LLM: Adaptively Compression Vision Tokens in LLM for High-Resolution Multimodal Large Language Models

Xinpeng Ding, Lewei Yao, Jianhua Han, Lanqing HONG, Hang Xu, Wei Zhang, Xiaomeng Li

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models; High-resolution; Efficiency

Abstract: High-resolution inputs empower Multimodal Large Language Models (MLLMs) to capture intricate visual details, thereby enhancing comprehension. However, the self-attention mechanism’s quadratic complexity poses significant computational and memory challenges as image resolution increases, particularly with long-vision tokens. Existing approaches generally alleviate these issues by reducing vision tokens before feeding them into LLMs. Although efficient, this Pre-LLM compression strategy fails to match the performance of models utilizing all tokens, particularly on high-resolution benchmarks. Our experiments reveal that the performance gap arises from this strategy’s limitation in selecting important visual tokens in early LLM layers, leading to the irretrievable loss of critical information. To overcome these challenges, we propose a new strategy that Adaptively Compresses vision Tokens within different LLM layers, named ACT-IN-LLM. Our innovative approach retains all tokens throughout the layers to ensure no vital information is lost while compressing key and value tokens in the self-attention mechanism, to reduce computational costs. The layer-wise compression of ACT-IN-LLM is guided by the interaction information between vision and text tokens, leading to more accurate selections. Our theoretical analysis and extensive experiments demonstrate the effectiveness of ACT-IN-LLM, showing a 6.3% improvement over existing token compression techniques. It also achieves the competitive performance with non-compression methods, while reducing training/inference time by ∼ 20% and vision tokens by ∼ 60%.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2000

Loading