AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MLLMs, Visual Tools, Reinforcement Learning
TL;DR: AdaReasoner enables multimodal models to dynamically orchestrate tools for complex reasoning, achieving state-of-the-art performance and emergent self-adaptive tool use.
Abstract: While augmenting Multimodal Large Language Models (MLLMs) with tools is a promising direction, current approaches face critical limitations. They often rely on single, atomic tools, failing to address the challenges of multi-turn planning, and they do not equip models with the ability to select effective tool combinations for complex tasks. To overcome these limitations, we introduce AdaReasoner, a framework that teaches models to perform dynamic tool orchestration for iterative visual reasoning. Our paradigm is designed to support a broad spectrum of tools, including computationally intensive, expert-model-based services. It features a comprehensive design that includes a new data curation methodology and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories, which yields state-of-the-art models that achieve substantial gains over their baselines (+38.7\% average on 7B) and reach near-perfect accuracy on complex benchmarks like Visual Spatial Planning (97.6\%). This performance surpasses leading proprietary systems such as GPT-5 and Claude Sonnet 4, demonstrating that our approach can effectively overcome scale-based limitations by augmenting smaller models with powerful tool-use capabilities. Critically, we find that AdaReasoner develops emergent, self-adaptive behaviors: it learns to autonomously adopt beneficial tools, discard irrelevant ones, and modulate its usage frequency. This ability to curate its own optimal problem-solving strategies represents a significant step toward building more robust, scalable, and reliable reasoning agents.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11599
Loading