VLSA: Enhancing Vision-Language Understanding via Perception and Cognition Alignment

Bo Zou; Chao Yang; Yuanfu Wang; Chaochao Lu

VLSA: Enhancing Vision-Language Understanding via Perception and Cognition Alignment

Bo Zou, Chao Yang, Yuanfu Wang, Chaochao Lu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Large Language Model, Feature Alignment, Vision-Language Understanding

Abstract: Prevalent Vision-Language (VL) alignment techniques within Multi-modal Large Language Models (MLLMs) struggle to adequately align the language model with visual inputs, resulting in hallucinations and undermining reliability. We rethink the modality alignment in MLLMs from the perspective of reducing information loss and present an efficient plug-in, VL Superior Alignment (VLSA), which decouples the alignment into two stages. The first stage, referred to as Perception Alignment, minimizes information loss in visual encoding through compressive encoding for high-resolution images and innovative reconstructive training leveraging latent diffusion models. The second stage, termed Cognition Alignment, reduces information loss in response generation by enhancing the language model's ability to grasp both high-level visual semantics and low-level image appearances, achieved by novel auxiliary self-supervised fine-tuning (SSFT) objectives. Extensive experiments across over 25 MLLM benchmarks and 7 MLLM architectures, thorough ablations, and analyses of computational overhead underscore the improvement of both performance and efficiency brought by VLSA. In service to the MLLM research community, our code and model checkpoints will be publicly available.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17075

Loading