Metrics for Holistic Evaluation of LLM Reasoning about Action, Change, and Planning

Anil B Murthy; Jaron Mink; Lindsay Sanneman

Metrics for Holistic Evaluation of LLM Reasoning about Action, Change, and Planning

Anil B Murthy, Jaron Mink, Lindsay Sanneman

Published: 24 Sept 2025, Last Modified: 25 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Evaluations, Planning, RAC

TL;DR: New informative Metrics for Evaluation of LLM responses in Planning and reasoning tasks

Abstract: Planning, reasoning, and sequential decision-making have played a pivotal role in the development of AI systems. While Large Language Models (LLMs) have demonstrated impressive capabilities, their evaluation for planning and Reasoning about Action and Change (RAC) problems is performed using strict binary success criteria, which limits information for further analysis and development. Given the probabilistic and autoregressive nature of LLMs, this work proposes the use of simple non-binary task-specific metrics for the evaluation of LLM responses for planning and reasoning tasks that go beyond perfect matching with ground truth, by utilizing set comparison methods, while still maintaining rigid and non-malleable evaluation criteria. We demonstrate the utility and usefulness of this type of metric in obtaining richer data fidelity and information about the quality, precision, nature of LLMs' responses, and their closeness to the ground truth through evaluations on six different tasks across two domains. With two case study examples, we additionally demonstrate the feasibility of comparative analysis of different task-specific data distributions obtained through this metric.

Submission Number: 251

Loading