Keywords: Visual grounding, Vision and Language, Multi-modal fusion
Abstract: Visual grounding aims to localize the target object in an input image according to a language expression. To achieve this purpose, existing methods either extract the visual and linguistic features independently or utilize language information to guide visual feature extraction. However, the former strategy generates identical, general-purpose visual representations for different text queries, which are redundant and sub-optimal for visual grounding tasks. Other methods that adopt the latter scheme typically construct sophisticated modules based on linguistic features, implicitly guiding the visual feature extraction through end-to-end training. But they often overlook fine-grained visual-linguistic alignment information, resulting in less discriminative visual features, and thus limiting overall model performance. In this paper, we propose a simple-yet-effective module named Highlighter, which explicitly calculates pixel-word correlations between visual and linguistic features, and then uses this correlation information to calibrate and enhance the visual representations. Based on the proposed module, we further introduce a cross-layer regularization loss, designed to maintain the consistency of fine-grained alignment information across different layers and to facilitate the transmission of supervision signals to shallow layers of the visual encoder. Extensive experiments demonstrate that our method achieves state-of-the-art performance on five widely used visual grounding datasets. And ablation studies also verify the effectiveness and efficiency of our method.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4362
Loading