Visual Representation Alignment for Multimodal Large Language Model

ACL ARR 2026 January Submission206 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Large Vision Language Models
Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details from the vision encoder during training. In this paper, we present Visual Representation Alignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from its own vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments consistently demonstrate performance improvements across all tasks on widely adopted multimodal benchmarks, with gains reaching up to 17.3% and an average improvement of 9.4% over the baseline. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 206
Loading