Keywords: Test time scaling, LLM reasoning, Reasoning strategies
Abstract: There is intense interest in investigating how inference time compute (ITC) (e.g. repeated sampling, refinements, etc) can improve large language model (LLM) capabilities. While breakthroughs like DeepSeek-R1 highlight the power of reinforcement learning for reasoning, the interaction between ITC and reasoning-optimized weights remains poorly understood. This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. While prior work suggests that scaling test-time compute can optimally substitute for model parameter scaling, we identify a fundamental limit to this compute-equivalence: the Reasoning Floor. We demonstrate that general-purpose models fail to match the accuracy of reasoning-optimized models even with an order of magnitude more inference compute, suggesting that internalizing reasoning protocols is a prerequisite for effective test-time scaling. Within reasoning models, we find that the complexity of the scaling method often yields diminishing returns; simple majority voting consistently outperforms sophisticated sequential revision and mixture-of-agents frameworks. Crucially, we identify a "Linguistic Signal of Correctness" ---correct responses are significantly more concise and exhibit a lower density of "hedging" and "thinking" markers. We demonstrate that these intrinsic linguistic features can serve as zero-compute proxies for response quality, providing a pathway to more efficient, self-diagnostic reasoning agents.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 192
Loading