Exploring and Benchmarking  Planning Capabilities of  Large Language Models

Bernd Bohnet; Azade Nova; Aaron T Parisi; Katayoon Goshvadi; Kevin Swersky; Hanjun Dai; Dale Schuurmans; Noah Fiedel; Hanie Sedghi

Exploring and Benchmarking Planning Capabilities of Large Language Models

Bernd Bohnet, Azade Nova, Aaron T Parisi, Katayoon Goshvadi, Kevin Swersky, Hanjun Dai, Dale Schuurmans, Noah Fiedel, Hanie Sedghi

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: planning capability, LLMs, many-shot, in-context learning

TL;DR: we benchmark and investigate and improve planning capability of LLMs

Abstract: Classical and natural language planning tasks remain a difficult domain for modern large language models (LLMs). In this work, we lay the foundations for improving planning capabilities of LLMs. First, we construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. This suite includes algorithms to methodically generate instances of tasks with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Next, we investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance. In addition, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths. We also probe the efficacy of chain-of-thought reasoning methods to improve LLM planning performance. Moreover, we probe the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges. Finally, we investigate model's failure modes and reveal insights that hold true across different benchmarks.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11213

Loading