Rethinking Modality Alignment in Multi-Modal Large Language Models

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Modal Large Language Model
Abstract: Multi-modal Large Language Models (MLLMs) demonstrate remarkable proficiency in addressing a wide range of Vision-Language (VL) tasks. However, most advancements have been focused on adapting to longer sequences containing detailed visual information and scaling up high-quality VL corpus. Prevalent VL alignment modules (e.g., the adapter layer in LLaVA and the Q-former in BLIP-2) struggle to align the LLM and visual inputs adequately. They rely on the powerful LLM to decode sub-optimally aligned visual features into the desired formatted word sequences, which can result in hallucinations and reduce the reliability of visual reasoning. Additionally, the LLM's causal attention does not effectively capture the relationship between visual embeddings. To tackle these issues, we rethink the modality alignment in MLLMs and present VL Superior Alignment (VLSA), a framework designed to decouple the alignment of the LLM with visual inputs. VLSA has two main stages: The perception alignment stage, which consists of innovative compressive high-resolution image encoding and reconstructive training based on Latent Diffusion Models (LDM), reduces the information loss in visual encoding and better models the spatial connection between images' subgraphs. The cognition alignment stage strengthens the LLM in understanding high-level visual semantics and low-level image appearances simultaneously. This advancement is actualized by following the instructions of predicting the codebook indices generated from a Vector Quantized (VQ) encoder and the pixel values within designated areas. Extensive experiments across 20 MLLM benchmarks underscore the consistent improvements brought by VLSA, demonstrating the effectiveness of our methods. In service to the MLLM research community, our code and model checkpoints will be publicly available.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9975
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview