Rethinking Compute-Optimal Test-Time Scaling for Mathematical Reasoning

Rethinking Compute-Optimal Test-Time Scaling for Mathematical Reasoning

ACL ARR 2025 May Submission2012 Authors

18 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during inference. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: \textbf{(1)} What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? \textbf{(2)} To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and AIME24 tasks, we have the following observations: \textbf{(1)} The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. \textbf{(2)} With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a \textbf{1B} LLM can exceed a \textbf{405B} LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a \textbf{0.5B} LLM outperforms \textbf{GPT-4o}, a \textbf{3B} LLM surpasses a \textbf{405B} LLM, and a \textbf{7B} LLM beats \textbf{o1} and \textbf{DeepSeek-R1}. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Large Language Model, Test-Time Scaling, Process Reward Model

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2012

Loading