Dissecting Demystifying Region-Based Representations in MLLMs

ICLR 2026 Conference Submission19156 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Models, Multimodal Models
TL;DR: Dissecting Demystifying Region-Based Representations in MLLMs
Abstract: Multimodal Large Language Models (MLLMs) typically process visual information as a flat sequence of image patch tokens, which is computationally expensive and lacks explicit semantic structure. This paper provides a systematic, vision-centric analysis of region-based representations, which group patches into semantically meaningful regions, as a more efficient and interpretable alternative. Our investigation is grounded in a key finding: MLLM performance is surprisingly robust to the input order of patch tokens, as the visual encoder already encode spatial information within the patches. This insight provides a foundational justification for reorganizing patches into semantically coherent regions. We further identify that the success of region-based methods depends on the quality of the visual features, particularly their smoothness and locality. We systematically evaluate how to enhance these properties through vision backbone selection, feature normalization, and hybrid partitioning strategies. Through comprehensive evaluations, we demonstrate that optimized region-based representations are a competitive alternative to patch-based ones, offering a compelling path towards more efficient, interpretable, and performant MLLMs.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19156
Loading