Keywords: Visual Grounding, Multi-modal Fusion, Token Merging, Effenciency
Abstract: Visual grounding (VG) aims to precisely localize the object in input images based on its natural language descriptions. Most recently proposed methods deal with this task with transformer-based architectures that can inject the textual information into the visual features. However, due to the image tokenlization procedure, there will be a large amount of image tokens located in text-irrelevant background areas. These tokens can introduce noise into the attention calculation, thus reducing the significance of foreground object tokens and ultimately affecting the effectiveness of these methods. To this end, we propose a novel Token Blurring (ToB) module, which dynamically merges image tokens based on the pair-wise visual similarity between them and their textual relevance with input expressions. By reducing the number of text-irrelevant background tokens and preserving the density of text-referred ones, ToB can improve both model effectiveness and efficiency in solving VG tasks. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that a transformer-based model equipped with our ToB module yield better results while reducing computational overhead compared to various VG methods.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11034
Loading