From Turns to Dialogues: Rethinking Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

ACL ARR 2025 February Submission5223 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. Moreover, while traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce **TD-Eval** (**T**urn and **D**ialogue-level **Eval**uation), a two-step framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn-level, we assess each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design Conversational Task Completion Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ and Tau-Bench, we demonstrate that TD-Eval effectively identifies the conversational errors that conventional metrics miss Furthermore, TD-Eval achieves strong alignment with human judgments while remaining simple to integrate into LLM-based agent pipelines. These findings demonstrate that TD-Eval offers a new paradigm for the evaluation of dialogue systems, providing a comprehensive assessment across both turn and system levels to capture the full spectrum of TOD systems.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Task Oriented Dialogue, Evaluation, Conversational Agents
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 5223
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview