Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding

Published: 05 Mar 2025, Last Modified: 20 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Grounding, Thinking and Reasoning, LLM, RL
TL;DR: Improving Spatial Reasoning in open-source LLMs for better Visual Grounding and making them comparable to closed-source LLMs like GPT4o using Thinking and Reasoning at inference.
Abstract: Visual grounding tasks involve identifying objects and references in an image based on text input. A model is required to locate the objects and their relationships, as well as to understand the image to accurately ground the target. Specialized models like Owl-ViT and Grounding DINO often fail to predict correct results for queries involving complex spatial information. In this paper, we propose a Spatial Thinking and Reasoning Dataset for visual grounding and a framework that uses existing detection models to identify candidate objects. These models provide coordinates and other attributes to a large language model (LLM) for spatial reasoning to determine the correct target. Recent closed-source models like GPT-4o achieve approximately 86% accuracy, while open-source models perform significantly worse, reaching only about 60% accuracy in our experiments. To improve this, we use reinforcement learning to fine-tune a 3B open-source model on our dataset, achieving 77% accuracy, comparable to closed-source models.
Submission Number: 77
Loading