Below the Reliability Floor: Recovering True Success from Judge-Gated Loops

Below the Reliability Floor: Recovering True Success from Judge-Gated Loops

TMLR Paper9786 Authors

16 Jun 2026 (modified: 20 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLM judges are increasingly placed inside an agent's loop, scoring the agent's own attempts and re-prompting until one passes. We show this quietly corrupts measurement: retry-until-PASS is optional stopping against a noisy classifier—it keeps drawing until the judge slips—so the reported pass rate is an upward-biased estimator of true success. We make this exact. The cap-$K$ gate is a binary classifier with closed-form sensitivity/specificity, and its bias is governed by one coefficient, the gate's Youden index $J$: as $J \to 0$ the gated rate becomes uninformative about $\pi$, so recovery must fall back to gold labels and no estimator beats the gold-only mean. Across $44$ capable-agent loops on GSM8K, MATH, and code with objective ground truth (no authored weakness; a separate terse-agent stress set is excluded here) the inflation is systematic (median slip $+0.16$; worst on code, where the judge cannot run the candidate, a true $0.74$ inflated to a reported $0.98$) and obeys a closed-form law predicting the slip from per-attempt statistics (pooled $r=0.95$; errors-in-variables slope $0.765\,[0.70,0.84]$, excluding $1$). To recover true success we benchmark Rogan–Gladen against prediction-powered inference: PPI++ dominates in aggregate (mean recovery MAE $0.050$ vs. $0.149$, and $0.081$ vs. $0.241$ as gold becomes scarce; the advantage concentrates in the high-bias regime, while per-gate differences on balanced gates are within noise), because it escapes the $1/J^2$ variance that makes the classical correction fragile. Beyond verifiable gold, we measure recovery on public, human-labeled non-verifiable gates—response safety and summary quality—where PPI++ recovers the true rate to mean MAE $0.043$ versus naive $0.165$ ($\sim 4\times$ in aggregate, up to $>10\times$ on the most-biased gates); and—in the motivating regime, a non-verifiable safety judge inside a real retry loop ($n=400$, against a pre-registered 3-model strong-LLM panel—a disclosed proxy, human-anchored at raw $0.90$ agreement)—a lenient gate ships $6.8\%$ truly-unsafe responses (95% CI $[0.05,0.10]$) while the calibrated correction recovers the panel safe-rate $\sim 3.5\times$ more accurately than naive. The deliverable is a recipe: report a PPI++ estimate alongside $J$ as a reliability/identifiability diagnostic, measure at $K=1$, and—via a label-free drift detector (ROC-AUC $0.80$)—de-bias only when calibration transfers. We release all code and content-free data.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Peng_Li2

Submission Number: 9786

Loading