AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

ICLR 2026 Conference Submission25416 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, multiturn, goal-shift, robustness, agents, evaluation, llm
TL;DR: We present a benchmark that stress-tests agents on explicit goal-shifts in dual-control, multi-turn dialogs. We also add sequence-annotated scenarios spanning multiple service domains, personas and goal-shift based evaluation metrics.
Abstract: Goal changes are a defining feature of real-world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce $\textbf{AgentChangeBench}$, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises of 590 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate a mix of proprietary and open source models and uncover sharp contrasts obscured by traditional pass@k scores. Our findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.
Primary Area: datasets and benchmarks
Supplementary Material: zip
Submission Number: 25416
Loading