Keywords: tool use, agent evaluation, agent reliability, faithfulness, trace analysis, failure mode analysis, pre-registration, benchmarking, interpretability, evaluation methodology
Abstract: Tool-using language models can call a verifier
and still emit final artifacts that contradict its
output. End-to-end task reward cannot distin-
guish these artifact-faithfulness failures from
honest errors.
Applying a deterministic trace-coder to 1,980
archived τ-bench (Yao et al., 2024) trajecto-
ries from gpt-4o and claude-3.5-sonnet—
with no harness modification—we find the fail-
ure mode on 46–68% of reward-0 trajectories
per cell, evidence that artifact-faithfulness fail-
ures are real and pervasive in third-party agent
traces. The coder is the rubric component of a
trace-local protocol (V,X,R) that separates ar-
tifact errors from verifier-handling errors over
a replayable trajectory.
The protocol is designed to adjudicate its own
claims. A pre-declared inferential family lets
a single instrument return supported, scope-
conditional, falsified, and reversed verdicts
against fixed substrate evidence without rhetor-
ical rescue. We exercise this on one within-cell
mechanism question—which textual signal re-
pairs a frozen contaminated artifact—and re-
port what the protocol decided across four sub-
strates: supported on the primary substrate and
replicated on Ariane 5 (Lions et al., 1996) and
τ-bench, falsified on a pre-registered SysML
forward test, and reversed on Spider text-to-
SQL (Yu et al., 2018). That the same instru-
ment cleanly adjudicates its own falsifications
is the contribution.
We release the protocol, the τ-bench external-
validation harness, frozen artifacts, pre-
registration commits at locked git tags, and
DriftGuard, a∼900-LoC library operationaliz-
ing the trace-local invariants as a drop-in guard
for verifier-in-the-loop pipelines.
Paper Type: Long
Research Area: LLM agents
Research Area Keywords: ool use, agent evaluation, agent reliability, faithfulness, interpretability, failure mode analysis, benchmarking, evaluation methodology
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 17282
Loading