Benchmarking and Improving PDDL Formalization Ability of Large Language Models with Planner-in-the-Loop Feedback
Keywords: symbolic planning, PDDL folmulitation, large language model, planner-in-the-loop
Abstract: Planning depends on symbolic specifications that are both executable and verifiable. However, large language models often generate Planning Domain Definition Language (PDDL) problem descriptions that appear syntactically well formed yet fail under strict precondition, effect-consistency, and reachability constraints. Even minor specification errors can render a task unsolvable, motivating benchmarks and learning signals grounded in planner-based verification rather than surface plausibility.
We present NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL specification construction, with planner-verified executability and object-count difficulty scaling. We also propose a planner-in-the-loop framework that uses validator and planner diagnostics to revise non-executable specifications via localized edits. Building on this toolchain, we present a planner-grounded optimization recipe combining parameter-efficient Low-Rank Adaptation supervised fine-tuning, offline planner-derived preference pairs for Direct Preference Optimization, and inference-time planner-in-the-loop repair, without online planner calls during training. We further provide a unified evaluation suite for parseability, solvability, specification similarity, and outcome-aware plan-level consistency against planner references. Experiments on eight representative model families show higher planner success and plan-level agreement, improved robustness under difficulty scaling and cross-domain transfer. Code and data are available at: https://anonymous.4open.science/r/NL-PDDL-Bench-BF76
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: fine-tuning ; neurosymbolic approaches; representation learning; LLM/AI agents
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 4822
Loading