YRC-Bench: A Benchmark for Learning to Coordinate with Experts

Published: 25 Dec 2025, Last Modified: 06 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. A critical component of AI safety is an agent’s ability to recognize when it is likely to fail in a novel situation and to yield control to a more capable expert system. Leveraging such expert assistance can significantly improve safety and performance in such situations. Since expert assistance is costly, a central challenge is determining when to consult an expert. In this paper, we explore a novel variant of this problem, termed YRC-0, in which an agent must learn to collaborate with an expert in new environments in an unsupervised manner–that is, without interacting with the expert during training. This setting motivates the development of low-cost, robust approaches for training expertleveraging agents. To support research in this area, we introduce YRC-Bench, an open-source benchmark that instantiates YRC-0 across diverse environments. YRC-Bench provides a standardized Gym-like API, simulated experts, an evaluation pipeline, and implementations of popular baselines. Toward tackling YRC-0, we propose a validation strategy and use a proposer-validator decomposition as a diagnostic framework to evaluate a range of learning methods, offering insights that can inform future research. Codebase: https://github.com/modanesh/YRC-Bench
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. **Title:** Updated to "YRC-Bench: A Benchmark for Learning to Coordinate with Experts" to better reflect the primary contribution. 2. **Problem Setting Motivation:** Expanded Section 3 to explicitly motivate the "frozen policy" constraint (no online learning) using real-world safety and regulatory examples (e.g., medical diagnosis, autonomous driving). 3. **Method Selection:** Added text to Section 5 justifying the choice of Deep SVDD as a representative baseline due to its efficiency in learning compact boundaries on latent representations. 4. **Diagnostic Framework:** Clarified in Section 3 that the "proposer-validator" decomposition is introduced primarily as a diagnostic framework for analyzing failure modes, rather than a rigid algorithmic requirement. 5. **Simulated Novice Design:** Added justification in Section 5 for using "limited supervision" rather than held-out tasks, clarifying that this heuristic allows the primary agent to maximize its capabilities using the full training distribution. 6. **De-anonymization:** Added author names, affiliations, the correct OpenReview ID, and the public GitHub link.
Code: https://github.com/modanesh/YRC-Bench
Assigned Action Editor: ~Martha_White1
Submission Number: 5993
Loading