Few Contrastive Attention Heads Enable Visual Grounding in Large Vision-Language Models

20 Mar 2026 (modified: 21 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual grounding aims to localize image regions corresponding to natural language expressions. While recent Large Vision-Language Models (LVLMs) have shown impressive multi-modal understanding capabilities, their application to visual grounding typically requires fine-tuning and architectural modifications. This requirement, however, can be ignored, considering that text and images tend to have similar feature representations that appear to be approximately linearly disentangled, enabling cleaner extraction of spatial information from LVLMs without any task-specific training. From this viewpoint, we propose an attention-head discovery framework that requires zero labeled grounding samples and no architectural modifications, and identifies discriminative localization heads without manual inspection. Through dual prompting with target and contrastive descriptions, we compute differential residual representations and project them through attention head output matrices to measure per-head spatial contributions via four complementary scores. By aggregating signals using importance-weighted query difference scores from only the top-10 attention heads, we outperform training-free non-LVLM baseline by up to 27.95% on RefCOCO, 21.93% on RefCOCO+, and 8.40% on RefCOCOg. Our method outperforms LVLM baseline by up to 8.04% on RefCOCO without requiring ground-truth category labels.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 8013
Loading