A Reward-Guided Dual-Phase Framework for Adaptive Inference-Time Reasoning

A Reward-Guided Dual-Phase Framework for Adaptive Inference-Time Reasoning

ACL ARR 2026 January Submission5902 Authors

05 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning, Test Time Scaling, Chain-of-Thought

Abstract: Large Language Models (LLMs) have made strong progress in reasoning. To enhance the reasoning performance, a common inference-time approach is tree-based search, which decomposes the reasoning process into multiple steps, expands multiple reasoning paths, and uses reward models to prune and select candidates. However, based on our exploration, the simple decomposition may lead to suboptimal searching efficiency: while planning is generally harder, it is the execution errors that are more likely to propagate to later steps. This indicates that planning and execution play different roles in reasoning and should be treated differently during tree-based search. Given this, to enhance the searching efficiency, we propose a dual-phase test-time scaling framework that separates reasoning into planning and execution, and performs search over each phase independently. To further refine the algorithm, we also introduce a dynamic budget allocation mechanism that adaptively redistributes sampling effort based on reward feedback, allowing early stopping on confident steps and reallocation of computation to more challenging steps. Experiments on both math reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing redundant computation.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: chain-of-thought

Languages Studied: English

Submission Number: 5902

Loading