Abstract: OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive Large Language Models (LLMs)–making it a new kind of model: a Large Reasoning Model (LRM)–and be generally capable of tackling procedural reasoning tasks. We present the first comprehensive evaluation of these models on the fundamental tasks of planning and scheduling. Previous research attempted to use LLMs’ expressive generation capabilities to solve these problems, but met with only limited success. We fill in the gaps in this literature by testing a larger suite of state-of-the-art LLMs on a set of large benchmarks, and then use this as a baseline to evaluate o1-preview and o1-mini. We see that while they can offer significant accuracy improvements over LLMs, this single metric is misleading and incomplete, as LRM queries demand large and unpredictable costs and take significant amounts of time to complete. We provide a case study demonstrating that, at those same price points, other methods of inference time scaling can do just as well. We also show that, contrary to OpenAI’s injunctions, o1’s performance can be improved further by embedding it in compound systems that separately, but complementarily, scale inference time further. Finally, while the paper is focused on o1, we provide similar evaluations of a more recent (and open-weight) LRM -- DeepSeek R1.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/karthikv792/LLMs-Planning
Assigned Action Editor: ~Sarath_Chandar1
Submission Number: 3839
Loading