Towards Fine-grained Evaluation of Large Reasoning Language Models in Task Planning

Towards Fine-grained Evaluation of Large Reasoning Language Models in Task Planning

ICLR 2026 Conference Submission15228 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large reasoning language models, task planning, safety, benchmark

Abstract: Large Reasoning Language Models (LRLMs) show strong potential in robotic task planning, but their reasoning processes remain unreliable: they may violate safety constraints, reducing the likelihood of producing correct plans, or show inconsistencies between the reasoning process and final results, which undermines interpretability and user trust. Existing evaluations of LRLMs rely mainly on outcome-based metrics, such as task success rate and token efficiency, which fail to capture these critical reasoning properties. This gap is especially concerning in safety-critical planning domains, where verifying the correctness of reasoning is essential. To address this issue, we propose a fine-grained safety evaluation framework that systematically analyzes the reasoning processes of LRLMs in task planning problems. Our method segments reasoning into chunks, summarizes each chunk into explicit planning steps, and verifies them against safety constraints using an external verifier, while applying rollback techniques to prevent bias in subsequent reasoning. Using a dataset of Planning Domain Definition Language (PDDL)-based problems, we conduct extensive experiments on various LLMs and LRLMs. The experimental results reveal the inconsistency between the reasoning process and the final output of LRLMs, as well as their limitations in detecting and correcting safety violation errors in their own reasoning process. These findings point out directions for future improvements.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 15228

Loading