Bridging the Granularity Gap: Object-Centric Masking for Contextual Visual Learning

Jike Zhong

Bridging the Granularity Gap: Object-Centric Masking for Contextual Visual Learning

Jike Zhong

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: Object-Centric Learning, Multimodal Spatial Reasoning

Abstract: In LLMs, emergent capabilities such as in-context learning and chain-of-thought reasoning have been closely associated with learning over discrete prediction units that are often semantically meaningful. In contrast, vision transformers, and multimodal LLMs (MLLMs) built on top of them, have yet to exhibit similarly robust capabilities. We hypothesize that this discrepancy stems in part from vision encoders being typically pre-trained over spatial patch tokens that are only weakly aligned with semantic entities, leading to representations that are less object-aware and less sensitive to global context, and therefore transfer less effectively to downstream tasks such as VQA and spatial reasoning. To bridge this gap, we propose to model objects as a stronger semantic unit for visual prediction, encouraging the encoder to learn the global context and semantics among visual elements. Specifically, we conduct a pilot study in the masked image modeling setting, where this hypothesis can be tested cleanly by masking visual objects rather than random patches during pre-training. Across qualitative analyses and quantitative benchmarks, we show that an object-centric objective reduces pixel-averaging shortcuts and yields more globally coherent and context-consistent representations. When used as the vision encoder in the MLLM frameworks LLaVA and BLIP, the resulting representations improve multimodal QA and vision-centric understanding benchmarks, including VQA, GQA, ScienceQA, and CVBench, by up to 8.57 points, indicating stronger context utilization. Overall, our results highlight object-centric prediction as a simple yet effective design choice for learning more semantic-rich and context-aware vision encoders, offering a promising direction for improving visual and multimodal intelligence.

Supplementary Material: pdf

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 27

Loading