Abstract: Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during inference. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: \textbf{(1)} What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? \textbf{(2)} To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and AIME24 tasks, we have the following observations: \textbf{(1)} The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. \textbf{(2)} With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a \textbf{1B} LLM can exceed a \textbf{405B} LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a \textbf{0.5B} LLM outperforms \textbf{GPT-4o}, a \textbf{3B} LLM surpasses a \textbf{405B} LLM, and a \textbf{7B} LLM beats \textbf{o1} and \textbf{DeepSeek-R1}. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large Language Model, Test-Time Scaling, Process Reward Model
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2012
Loading