Bottom-Up and Top-Down Thoughts for Visual Intention Grounding

Published: 29 Jun 2025, Last Modified: 28 Apr 2026ICMREveryoneCC BY-NC-SA 4.0
Abstract: RIO (Reasoning Intention-oriented Object) is a visual grounding task aimed at locating object within an image that best matches a given intention. Due to the implicit referential nature of the intention descriptions and the inherent complexity of the scenes depicted in images, existing methods struggle to effectively accomplish this task.In this paper, we propose a novel training-free approach that decouples visual processing and reasoning to effectively solve this task. Our approach comprises two complementary pipelines: a top-down pipeline that first performs textual reasoning before proceeding to visual processing, and a bottom-up pipeline that initially conducts visual processing followed by textual reasoning. We then employ a meticulously designed strategy to merge the outputs of these parallel pipelines, thereby achieving complementarity. Experimental results demonstrate that our approach attains performance levels comparable to those of fine-tuned visual grounding models on the RIO dataset. Additional experiments conducted on the SKVG dataset further attest to the generalizability of our method.
Loading