Think Deep, Think Fast: Investigating Inference-Time Scaling And The Reasoning Floor

Junlin Wang; Shang Zhu; Jon Saad-Falcon; Ben Athiwaratkun; Qingyang Wu; Jue WANG; Shuaiwen Leon Song; Ce Zhang; Bhuwan Dhingra; James Zou

Think Deep, Think Fast: Investigating Inference-Time Scaling And The Reasoning Floor

Junlin Wang, Shang Zhu, Jon Saad-Falcon, Ben Athiwaratkun, Qingyang Wu, Jue WANG, Shuaiwen Leon Song, Ce Zhang, Bhuwan Dhingra, James Zou

Published: 01 Jun 2026, Last Modified: 09 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test time scaling, LLM reasoning, Reasoning strategies

Abstract: There is intense interest in investigating how inference time compute (ITC) (e.g. repeated sampling, refinements, etc) can improve large language model (LLM) capabilities. While breakthroughs like DeepSeek-R1 highlight the power of reinforcement learning for reasoning, the interaction between ITC and reasoning-optimized weights remains poorly understood. This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. While prior work suggests that scaling test-time compute can optimally substitute for model parameter scaling, we identify a fundamental limit to this compute-equivalence: the Reasoning Floor. We demonstrate that general-purpose models fail to match the accuracy of reasoning-optimized models even with an order of magnitude more inference compute, suggesting that internalizing reasoning protocols is a prerequisite for effective test-time scaling. Within reasoning models, we find that the complexity of the scaling method often yields diminishing returns; simple majority voting consistently outperforms sophisticated sequential revision and mixture-of-agents frameworks. Crucially, we identify a "Linguistic Signal of Correctness" ---correct responses are significantly more concise and exhibit a lower density of "hedging" and "thinking" markers. We demonstrate that these intrinsic linguistic features can serve as zero-compute proxies for response quality, providing a pathway to more efficient, self-diagnostic reasoning agents.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 192

Loading