Keywords: thinking with images, o3, visual search, multi-agent framework, reinforcement learning
Abstract: The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception.
However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps.
To address this gap, we introduce O3-bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details.
O3-bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning.
The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8\% accuracy on O3-bench.
To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search---locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images.
We then present a multimodal LLM purpose-trained for this task via reinforcement learning.
As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks.
This marks a concrete step towards powerful o3-like open systems.
Our code and dataset can be found at https://github.com/m-Just/InSight-o3.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 572
Loading