ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Published: 03 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic Memory, Long-Horizon Evaluation, Task-Oriented Dialogue, Multi-Goal Reasoning, Dependency Management, Proactive Agents
TL;DR: ATOD is a benchmark and evaluation framework that probes how agentic dialogue systems use explicit memory to track, update, and coordinate interdependent goals over long horizons, enabling fine-grained evaluation of memory-driven agent behavior.
Abstract: Recent advances in task-oriented dialogue (TOD) systems, driven by LLMs with extensive API and tool integration, have expanded the scope of conversational agents beyond traditional turn-by-turn task execution. Modern systems are increasingly expected to coordinate interleaved goals, preserve long-horizon context, and provide proactive assistance under asynchronous execution. However, existing benchmarks do not systematically evaluate these agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-horizon reasoning. ATOD captures key characteristics of Advanced TOD, including multi-goal coordination, dependency management, long-horizon memory, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that operationalizes these dimensions through fine-grained metrics and supports reproducible evaluation in both offline and online settings. We further present an agentic memory-based evaluator for benchmarking models on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment of task completion, agentic capability, and response quality, and that the proposed evaluator provides a favorable accuracy--efficiency trade-off relative to strong memory-based and LLM-based baselines under this evaluation setting.
Submission Number: 18
Loading