VISA: Preserving Fine-Grained Perception in MLLMs via Visual Semantic Anchoring

Lixuan He; Shulin Tian; Ziwei Liu

VISA: Preserving Fine-Grained Perception in MLLMs via Visual Semantic Anchoring

Lixuan He, Shulin Tian, Ziwei Liu

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Fine-Grained Perception, Semantic Drift, Plug-and-Play Module

TL;DR: We propose VISA, a training framework that uses a powerful VFM as a semantic anchor to provide direct visual supervision to MLLM's intermediate layers, counteracting the loss of fine-grained detail caused by indirect text-only training.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in general-purpose visual understanding. However, their training paradigm faces a fundamental bottleneck: the challenge of learning high-fidelity visual representations from indirect, text-based objectives alone. This inefficient process leads to a phenomenon we term semantic attenuation, where internal visual representations lose critical, high-fidelity details, hindering performance on tasks requiring fine-grained perception. To resolve this core representation learning challenge, we propose **VIsual Semantic Anchoring (VISA)**, a novel and general training framework that introduces a direct, vision-native supervisory signal into the MLLM's intermediate layers. By anchoring the MLLM's representations to the rich feature space of a pretrained Vision Foundation Model (VFM) through representation alignment, VISA ensures its visual pathway learns and maintains a detailed and structured understanding of the visual world. Our composite loss, which enforces both point-wise semantic alignment and structural consistency, makes this process highly effective. Extensive experiments on diverse benchmarks and model backbones demonstrate that by fostering more robust internal representations, VISA significantly enhances fine-grained reasoning, improves factual grounding against hallucinations, and accelerates training convergence, establishing a new and effective paradigm for developing more perceptually robust MLLMs. Our code is open-sourced via \url{https://anonymous.4open.science/r/anonymous_VISA-D482/}

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11360

Loading