Abstract: We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's latest reasoning models show promising performance through use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as *Sys2Bench*, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning.
*Sys2Bench* provides a unified framework for revealing the strengths and limitations of current inference-time methods, setting the stage for more principled and scalable approaches to LLM reasoning.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ellen_Vitercik1
Submission Number: 7260
Loading