Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

TMLR Paper7260 Authors

30 Jan 2026 (modified: 15 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's latest reasoning models show promising performance through use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as *Sys2Bench*, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. *Sys2Bench* provides a unified framework for revealing the strengths and limitations of current inference-time methods, setting the stage for more principled and scalable approaches to LLM reasoning.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ellen_Vitercik1
Submission Number: 7260
Loading