Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon; Jaewoo Jung; Junwan Kim; Hyungyu Choi; Heeseong Shin; Sangbeom Lim; Honggyu An; Chaehyun Kim; Jisang Han; Donghyun Kim; Chanho Eom; Sunghwan Hong; Seungryong Kim

Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

Published: 02 Mar 2026, Last Modified: 14 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: multimodal large language model

Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting and spatial reasoning. We find that this limitation arises not merely from the choice of vision encoder, but from the lack of explicit supervision on visual representations during training, which causes detailed visual information to be gradually weakened even when strong vision foundation models (VFMs) are used as vision encoders. To this end, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained VFMs. By explicitly enforcing this alignment, VIRAL preserves rich visual information within the MLLM while enabling it to leverage complementary visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 59

Loading