Keywords: Visual Grounding, Thinking and Reasoning, LLM, RL
TL;DR: Improving Spatial Reasoning in open-source LLMs for better Visual Grounding and making them comparable to closed-source LLMs like GPT4o using Thinking and Reasoning at inference.
Abstract: Visual grounding tasks involve identifying objects and references in an image based on text input. A model is required to locate the objects and their relationships, as well as to understand the image to accurately ground the target. Specialized models like Owl-ViT and Grounding DINO often fail to predict correct results for queries involving complex spatial information. In this paper, we propose a Spatial Thinking and Reasoning Dataset for visual grounding and a framework that uses existing detection models to identify candidate objects. These models provide coordinates and other attributes to a large language model (LLM) for spatial reasoning to determine the correct target. Recent closed-source models like GPT-4o achieve approximately 86% accuracy, while open-source models perform significantly worse, reaching only about 60% accuracy in our experiments. To improve this, we use reinforcement learning to fine-tune a 3B open-source model on our dataset, achieving 77% accuracy, comparable to closed-source models.
Submission Number: 77
Loading