Background Blurring Matters: Improving Visual Grounding by Merging Text-Irrelevant Tokens

Yi Rong; Shanshan Yang; Ruilin Yao; Shili Xiong

Background Blurring Matters: Improving Visual Grounding by Merging Text-Irrelevant Tokens

Yi Rong, Shanshan Yang, Ruilin Yao, Shili Xiong

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Grounding, Multi-modal Fusion, Token Merging, Effenciency

Abstract: Visual grounding (VG) aims to precisely localize the object in input images based on its natural language descriptions. Most recently proposed methods deal with this task with transformer-based architectures that can inject the textual information into the visual features. However, due to the image tokenlization procedure, there will be a large amount of image tokens located in text-irrelevant background areas. These tokens can introduce noise into the attention calculation, thus reducing the significance of foreground object tokens and ultimately affecting the effectiveness of these methods. To this end, we propose a novel Token Blurring (ToB) module, which dynamically merges image tokens based on the pair-wise visual similarity between them and their textual relevance with input expressions. By reducing the number of text-irrelevant background tokens and preserving the density of text-referred ones, ToB can improve both model effectiveness and efficiency in solving VG tasks. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that a transformer-based model equipped with our ToB module yield better results while reducing computational overhead compared to various VG methods.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11034

Loading