$\tau^2$-bench: : Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres; Honghua Dong; Soham Ray; Xujie Si; Karthik R Narasimhan

$\tau^2$-bench: : Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik R Narasimhan

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Evaluation, Dual Control, Conversational AI Agents, User Simulation

TL;DR: $\tau^2$-bench introduces a new way to test AI agents by letting both the agent and a simulated user interact in a shared "telecom" world. This allows for creating diverse, verifiable tasks and better user simulation.

Abstract: Existing benchmarks for conversational AI agents simulate **single-control environments**, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: 1. A novel **Telecom dual-control domain** modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2. A **compositional task generator** that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3. A **reliable user simulator** tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4. **Fine-grained analysis of agent performance** through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 12919

Loading