Are General-Purpose LLMs Ready for Planning? A Large- Scale Evaluation in PDDL

Kaustubh Vyas; Damien Graux; Sebastien Montella; Pavlos Vougiouklis; Jeff Z. Pan

Are General-Purpose LLMs Ready for Planning? A Large- Scale Evaluation in PDDL

Kaustubh Vyas, Damien Graux, Sebastien Montella, Pavlos Vougiouklis, Jeff Z. Pan

Published: 24 Jul 2025, Last Modified: 04 Oct 2025XLLM-Reason-PlanEveryoneRevisionsBibTeXCC BY 4.0

Keywords: PDDL, LLM, Capability, Benchmark, LLM-family

Abstract: In recent advancements, large language models (LLMs) have exhibited proficiency in code generation and chain-of-thought reasoning, laying the groundwork for tackling automatic formal planning tasks. This study evaluates the potential of LLMs to understand and generate Planning Domain Definition Language (PDDL), an essential representation in artificial intelligence planning. We conduct an extensive analysis across 20 distinct models spanning 7 major LLM families, both commercial and open-source. We focus exclusively on general-purpose, off-the-shelf models, excluding recent reasoning-centric models, to avoid confounding from task-specific architectural scaffolding and to evaluate the native planning fluency of widely deployed LLMs. Our comprehensive evaluation sheds light on the zero-shot LLM capabilities of parsing, generating, and reasoning with PDDL. Our findings indicate that while some models demonstrate notable effectiveness in handling PDDL, others pose limitations in more complex scenarios requiring nuanced planning knowledge. These results highlight the promise and current limitations of LLMs in formal planning tasks, offering insights into their application and guiding future efforts in AI-driven planning paradigms.

Paper Published: No

Paper Category: Long Paper

Supplementary Material: zip

Demography: Prefer not to say

Academic: Masters Student

Submission Number: 25

Loading