Verifying the Verifiers: Failure Attribution for Agentic Benchmark Diagnostics and Training Data Curation

Jesse Hu; Pratyush Shukla; Ke Huang

Verifying the Verifiers: Failure Attribution for Agentic Benchmark Diagnostics and Training Data Curation

Jesse Hu, Pratyush Shukla, Ke Huang

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: failure attribution, coding agents, benchmark evaluation, self-evolving agents, LLM-as-judge, training data curation, reward signal quality

TL;DR: We formalize failure attribution for coding agent benchmarks, validate it with high human agreement, and automate it with AutoTriage. We reveal that weaker critics systematically misattribute, corrupting self-improvement loops.

Abstract: When coding agents fail benchmark tasks, the failure is opaque: benchmarks report only that the agent failed, not why. This matters for two critical use cases. For evaluation, practitioners need to know whether failures reflect agent limitations or task defects—ambiguous specifications, flaky tests, broken environments—to diagnose and improve their systems. For self-evolving agents, failures serve as reward signal for RL training, but training on task defects as if they were agent errors introduces noise that corrupts learned policies and prevents productive adaptation. We formalize this problem with a failure attribution taxonomy and validate it through a human annotation study on a software engineering benchmark, establishing high inter-annotator agreement (κ=0.929). We then present AutoTriage, a system that automates failure attribution by deploying an agentic judge with sandboxed environment access to investigate trajectories—executing code, running tests, and analyzing error logs. We evaluate nine configurations across three models and three access modes (text-only, read-only agent, full sandbox). The best configuration—sandbox execution with GPT-5.2 Codex—achieves near-human agreement (κ=0.833), though the benefit of execution access is model-dependent. Error analysis reveals that weaker triage models exhibit a systematic directional bias: they over-attribute agent failures to task defects, constructing plausible defenses of the agent rather than identifying root causes—a failure mode with direct implications for any agent that uses its own critic to drive lifelong self-improvement.

Submission Number: 192

Loading