Look&Learn: Where to Look? Bridging Perception and Grounding Gap in Vision-Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention Guidance, VLM, LLM, Understanding, Perception, Image, 2D
TL;DR: Look&Learn, a segmentation-free grounding approach that enables VLMs to attend to the correct image regions while generating text without requiring an explicit mask generation module.
Abstract: Vision-Language Models (VLMs) excel at perception and reasoning, yet their ability to precisely ground visual concepts remains limited. A common remedy is to attach a segmentation decoder (e.g., SAM) to VLMs, which indeed equips them with segmentation capabilities but simultaneously erodes their understanding performance. We call this the understanding–grounding gap, where enhancing grounding comes at the expense of perception. In this work, we propose Look&Learn, a segmentation-free grounding framework that closes this gap by enabling the LLM itself to act as a segmentor. At the core is Where to Look?, a lightweight, plug-and-play loss applied directly to the attention heatmaps of arbitrary VLMs. Rather than relying on an additional decoder, our loss guides the model to focus on relevant image regions during text generation, effectively leveraging the LLM’s own attention as high-quality segmentation masks. Your native LLM is your segmentor. Look&Learn is efficient and versatile: it can be integrated seamlessly into any VLM architecture and applied during pre-training, fine-tuning, or even post-training. To enable joint training of understanding and grounding, we further construct a scalable pseudo-grounding dataset that aligns textual and visual entities using state-of-the-art grounding models combined with LLM-based adjudication. Extensive experiments demonstrate that Look&Learn consistently improves grounding performance without sacrificing perception. On LLaVA, it delivers +3% gains on grounding benchmarks while also boosting Understanding by 1–2%. On GLaMM, our loss further provides +1% improvements across three grounding datasets. These results confirm our central hypothesis: grounding and understanding are not competing objectives but can be unified within a single model, without explicit segmentation supervision. Our work paves the way for more interpretable, efficient, and multimodal-aware VLMs, and more broadly, opens the door to using LLMs themselves as segmentors.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11200
Loading