From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce AR-Bench and reveals that large language models struggle with active reasoning, highlighting a gap in their ability to gather and use information for real-world problem-solving.
Abstract: While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning—where an LLM must interact with external systems to acquire missing evidence or data—has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families—detective cases, situation puzzles, and guessing numbers—that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.
Lay Summary: Large language models (LLMs) are often tested on their ability to reason using complete information, but real-world situations frequently require them to actively seek out missing data to solve problems. This gap in evaluating how well LLMs can handle such "active reasoning" scenarios—where they must interact with external systems to gather necessary information. We want to investigate their performance in these more dynamic, real-world-like challenges. We created AR-Bench, a new benchmark specifically designed to test LLMs’ active reasoning skills. It includes three types of tasks—detective cases, situation puzzles, and guessing numbers—that mimic real-world scenarios requiring commonsense, logical, and symbolic reasoning. We tested various LLMs on these tasks and explored advanced strategies, like tree-based searching and post-training methods, to see if they could improve the models’ ability to gather and use missing information effectively. Our findings show that current LLMs struggle significantly with active reasoning, revealing a clear gap between their ability to process given information and their capacity to seek out what’s missing. This matters because real-world applications, like decision-making or problem-solving in dynamic environments, often require such skills. Our work highlights the need for new training approaches to make LLMs more capable and reliable in practical, proactive scenarios.
Link To Code: https://github.com/tmlr-group/AR-Bench
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Reasoning, Active Reasoning
Submission Number: 4384
Loading