Verifying the Verifiers: Failure Attribution for Agentic Benchmark Diagnostics and Training Data Curation

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0
Keywords: benchmarking, evaluation, training data, failure modes, coding agents, LLM-as-judge, recursive self-improvement, data curation, reinforcement learning
TL;DR: We formalize failure attribution for coding agent benchmarks, validate it with humans, and automate it with AutoTriage, revealing that weaker critics systematically misattribute agent errors to task defects.
Abstract: When coding agents fail benchmark tasks, the failure is opaque: benchmarks report only that the agent failed, not why. This matters for two critical use cases. For evaluation, practitioners need to know whether failures reflect agent limitations or task defects—ambiguous specifications, flaky tests, broken environments—to diagnose and improve their systems. For recursive self-improvement, failures serve as reward signal for RL training, but training on task defects as if they were agent errors introduces noise that corrupts learned policies and prevents productive self-modification. We formalize this problem with a failure attribution taxonomy and validate it through a human annotation study on a software engineering benchmark, establishing high inter-annotator agreement ($\kappa = 0.929$). We then present AutoTriage, a system that automates failure attribution by deploying an agentic judge with sandboxed environment access to investigate trajectories—executing code, running tests, and analyzing error logs. We evaluate nine configurations across three models and three access modes (text-only, read-only agent, full sandbox). The best configuration—sandbox execution with GPT-5.2 Codex—achieves near-human agreement ($\kappa = 0.833$), though the benefit of execution access is model-dependent. Error analysis reveals that weaker triage models exhibit a systematic directional bias: they over-attribute agent failures to task defects, constructing plausible defenses of the agent rather than identifying root causes—a failure mode with direct implications for any system that uses its own critic to drive self-improvement.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 128
Loading