Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs

17 Sept 2025 (modified: 31 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: point-based grounding, few-shot learning, multimodal large language model
Abstract: Part-level pointing is important for fine-grained interaction and reasoning, yet existing Multimodal Large Language Models (MLLMs) remain limited to instance-level pointing. Part-level pointing presents unique challenges: annotation is costly, parts are long-tail distributed, and many are difficult to specify precisely in language. We introduce POinting at Parts (POP), a training-free, plug-and-play approach that addresses these challenges under a few-shot setup. POP fuses textual and visual attention maps with self-supervised visual correspondences from query image and few-shot examples. On average across the three evaluated datasets, POP achieves accuracy gains of up to 8.9 points in the one-shot setting and 16.4 points in the three-shot setting for the pointing-capable MLLMs—Qwen2.5-VL, Ovis2.5, and Molmo. Notably, even MLLMs without pointing capability benefit significantly from the proposed approach. These results establish a simple yet effective path toward fine-grained spatial grounding in MLLMs.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9000
Loading