Keywords: Benchmarking, Theory, Agentic AI, Uncertainty Quantification, Proper Scoring Rules
TL;DR: We show that standard agentic uncertainty metrics can hide miscalibrated confidence traces, especially when trajectories are collapsed or censored, and introduce a strictly proper trajectory score as a fix.
Abstract: Standard agentic UQ evaluations can hide trace-level failure modes. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized Trajectory Brier evaluate rankings, binwise calibration, or collapsed trajectory summaries, but none strictly elicit the prefix-conditioned success-probability process $q_t=\mathbb{P}^{\pi}(Y=1\mid\mathcal{H}_t)$. The result is a practical diagnostic failure: a confidence trace can appear acceptable under standard metrics while being badly mis-scaled for deferral, reflection, human handoff, or cost-weighted decisions. We characterize this failure mode theoretically and empirically. Theoretically, we show that Trajectory ECE is resolution-blind and that scalarized Trajectory Brier under common aggregators is not strictly proper for the trace. Empirically, on Tau2-Bench, Platt recalibration changes AUROC by only $\Delta/\mathrm{SE}\approx 0.3$ while changing a strictly proper trajectory score by $\Delta/\mathrm{SE}\approx 43$; on WebShop, complete-only evaluation drops 47.08% of the assumption-valid working sample, the dropped trajectories are roughly $3\times$ longer, and censored-aware scoring changes the reported score. As a fix, we introduce the Trajectory Proper Score (TPS), a strictly proper trajectory-level evaluator built from any strictly proper binary score and positive trajectory weights, with a conditional-projection extension for administratively censored prefixes. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that evaluator choice can shift benchmark conclusions by margins far larger than bootstrap uncertainty.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 107
Loading