Benchmarking and Improving PDDL Formalization Ability of Large Language Models with Planner-in-the-Loop Feedback

Benchmarking and Improving PDDL Formalization Ability of Large Language Models with Planner-in-the-Loop Feedback

ACL ARR 2026 January Submission4822 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: symbolic planning, PDDL folmulitation, large language model, planner-in-the-loop

Abstract: Planning depends on symbolic specifications that are both executable and verifiable. However, large language models often generate Planning Domain Definition Language (PDDL) problem descriptions that appear syntactically well formed yet fail under strict precondition, effect-consistency, and reachability constraints. Even minor specification errors can render a task unsolvable, motivating benchmarks and learning signals grounded in planner-based verification rather than surface plausibility. We present NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL specification construction, with planner-verified executability and object-count difficulty scaling. We also propose a planner-in-the-loop framework that uses validator and planner diagnostics to revise non-executable specifications via localized edits. Building on this toolchain, we present a planner-grounded optimization recipe combining parameter-efficient Low-Rank Adaptation supervised fine-tuning, offline planner-derived preference pairs for Direct Preference Optimization, and inference-time planner-in-the-loop repair, without online planner calls during training. We further provide a unified evaluation suite for parseability, solvability, specification similarity, and outcome-aware plan-level consistency against planner references. Experiments on eight representative model families show higher planner success and plan-level agreement, improved robustness under difficulty scaling and cross-domain transfer. Code and data are available at: https://anonymous.4open.science/r/NL-PDDL-Bench-BF76

Paper Type: Long

Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Research Area Keywords: fine-tuning ; neurosymbolic approaches; representation learning; LLM/AI agents

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 4822

Loading