Keywords: thinking with images, o3, visual search, multi-agent framework, reinforcement learning
Abstract: The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect that are crucial for real-world tasks like analyzing documents with dense charts/diagrams or navigating maps. To address this gap, we first introduce o3-bench, a new benchmark designed to evaluate multimodal reasoning while attending to visual details. O3-bench features challenging questions that require agents to gather subtle visual information from multiple distinct areas of an image while performing complex, interleaved reasoning using the gathered information. These tasks are highly challenging even for frontier systems like OpenAI o3, which only obtains 42.8% accuracy on o3-bench. To tackle these tasks, we propose InSight-o3, a multi-agent framework that divides labor between a visual reasoning agent (vReasoner) and a visual search agent (vSearcher). As a concrete first step towards o3-like systems, we focus on the latter (i.e., vSearcher) in this paper, for which we introduce the task of generalized visual search---locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent that can be directly called by other agents, our vSearcher significantly improves the performance of existing frontier multimodal models by empowering them with generalized visual search on a wide range of benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 572
Loading