Keywords: Multimodal Large Language Model; Projector; Token Compression;
Abstract: The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in Multimodal Large Language Models (MLLMs).
However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, with current evaluations relying primarily on the performance of MLLMs on downstream tasks.
Motivated by this gap, this study conducts an in-depth examination of the projector module by analyzing the vision-language semantic flow within MLLMs.
Our findings reveal that compressive projectors (e.g., QFormer) reduce the number of visual tokens by abstracting visual patches into a limited set of semantic concepts, such as objects or attributes, leading to a deficiency we term ``double abstraction'' in MLLMs. This phenomenon involves i) an initial visual semantic abstraction by the projector in the vision modality, which refers to pre-defined query tokens, and ii) a secondary extraction by the LLM in the language modality based on text instructions.
The double abstraction is inefficient during training and leads to cumulative deficiencies in visual semantics. To address this issue, we propose the key insight of ''`\textbf{De}couple Token \textbf{Co}mpression from Semantic Abstraction \textbf{(\model)}'', where projectors compress visual tokens at the patch level non-semantically, while allowing the LLM to fully manage semantic understanding and abstraction.
Consequently, we employ a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner.
Empirical evaluations demonstrate that 2D Adaptive Pooling outperforms traditional compressive projectors in both performance and efficiency, achieving gains of 0.9\%, 7.1\%, and 2.9\% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks, respectively, while utilizing fewer trainable parameters and achieving faster convergence.
Furthermore, it preserves vision spatial locality and exhibits robustness across various MLLM configurations, including different vision backbones, image resolutions, and LLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5860
Loading