Think Deep, Think Fast: Investigating Efficiency of Trained-verifier-free Inference-time-scaling Methods

Junlin Wang; Shang Zhu; Jon Saad-Falcon; Ben Athiwaratkun; Qingyang Wu; Jue WANG; Shuaiwen Leon Song; Ce Zhang; Bhuwan Dhingra; James Zou

Think Deep, Think Fast: Investigating Efficiency of Trained-verifier-free Inference-time-scaling Methods

Junlin Wang, Shang Zhu, Jon Saad-Falcon, Ben Athiwaratkun, Qingyang Wu, Jue WANG, Shuaiwen Leon Song, Ce Zhang, Bhuwan Dhingra, James Zou

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Inference-time scaling methods, LLM Reasoning, Compute Optimal

Abstract: There is intense interest in investigating how inference time compute (ITC) (e.g. repeated sampling, refinements, etc) can improve large language model (LLM) capabilities. At the same time, recent breakthroughs in reasoning models, such as Deepseek-R1, unlock the opportunity for reinforcement learning to improve LLM reasoning skills. An in-depth understanding of how ITC interact with reasoning across different models could provide important guidance on how to further advance the LLM frontier. This work conducts a comprehensive analysis on inference-time scaling methods for both reasoning and nonreasoning models on challenging reasoning tasks. Specifically, we focus our research on verifier-free inference time scaling methods due to its generalizability without needing a reward model. We construct the Pareto frontier of quality and efficiency. We find that non-reasoning models, even with an extremely high inference budget, still fall behind reasoning models. For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods like best-of-N and sequential revisions, while the additional inference compute offers minimal improvements. We further performed an in-depth analysis of the effect of key response features (length and linguistics markers) on the response quality, with which we can improve the existing ITC methods. We found that correct resposnes from reasoning models are typically shorter and have less linguistic markers.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9813

Loading