Adjudicating Artifact-Faithfulness Claims in Tool-Using LLM Agents: A Trace-Local Protocol

ACL ARR 2026 May Submission17282 Authors

26 May 2026 (modified: 19 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: tool use, agent evaluation, agent reliability, faithfulness, trace analysis, failure mode analysis, pre-registration, benchmarking, interpretability, evaluation methodology
Abstract: Tool-using language models can call a verifier and still emit final artifacts that contradict its output. End-to-end task reward cannot distin- guish these artifact-faithfulness failures from honest errors. Applying a deterministic trace-coder to 1,980 archived τ-bench (Yao et al., 2024) trajecto- ries from gpt-4o and claude-3.5-sonnet— with no harness modification—we find the fail- ure mode on 46–68% of reward-0 trajectories per cell, evidence that artifact-faithfulness fail- ures are real and pervasive in third-party agent traces. The coder is the rubric component of a trace-local protocol (V,X,R) that separates ar- tifact errors from verifier-handling errors over a replayable trajectory. The protocol is designed to adjudicate its own claims. A pre-declared inferential family lets a single instrument return supported, scope- conditional, falsified, and reversed verdicts against fixed substrate evidence without rhetor- ical rescue. We exercise this on one within-cell mechanism question—which textual signal re- pairs a frozen contaminated artifact—and re- port what the protocol decided across four sub- strates: supported on the primary substrate and replicated on Ariane 5 (Lions et al., 1996) and τ-bench, falsified on a pre-registered SysML forward test, and reversed on Spider text-to- SQL (Yu et al., 2018). That the same instru- ment cleanly adjudicates its own falsifications is the contribution. We release the protocol, the τ-bench external- validation harness, frozen artifacts, pre- registration commits at locked git tags, and DriftGuard, a∼900-LoC library operationaliz- ing the trace-local invariants as a drop-in guard for verifier-in-the-loop pipelines.
Paper Type: Long
Research Area: LLM agents
Research Area Keywords: ool use, agent evaluation, agent reliability, faithfulness, interpretability, failure mode analysis, benchmarking, evaluation methodology
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 17282
Loading