Keywords: Machine Learning, Reinforcement Learning, Data Quality, Data Curation, Benchmarks
Abstract: When coding agents fail benchmark tasks, the failure is opaque: benchmarks report only that the agent failed, not why. This matters for two increasingly critical use cases. For evaluation, practitioners need to know whether failures reflect agent limitations or task defects—ambiguous specifications, flaky tests, broken environments—to diagnose and improve their systems. For RL training, failures serve as negative reward signal, but training on task defects as if they were agent errors introduces noise that corrupts learned policies. We formalize this problem with a failure attribution taxonomy and validate it through human annotation (κ > 0.84, N = 158) across two benchmarks, revealing significant variation in task defect rates. We then present AutoTriage, a system that automates failure attribution by deploying an agentic judge with sandboxed environment access to investigate trajectories—executing code, running tests, navigating file systems, and analyzing error logs. We evaluate nine configurations across three models and three access modes (text-only, read-only agent, full sandbox). On a software engineering benchmark, AutoTriage achieves κ = 0.83, reaching 90% of human inter-annotator agreement. To our knowledge, this is the first use of an agentic judge with full execution access for failure triage. Our framework provides a missing diagnostic layer in the agent development pipeline, transforming benchmarks from pass/fail scoreboards into tools for both targeted improvement and clean training data curation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 172
Loading