Keywords: benchmarking, explanation faithfulness, evaluation methodologies
Abstract: Benchmark items have several parts --- an instruction, source records, a
reference answer, and evaluator code --- and any of them can be wrong. A
natural response is to ask an LLM to audit the item; but this creates a
circular measurement problem, since evaluating the auditor seems to require
another auditor. We sidestep this by constructing audit tasks whose correct
verdict is known by design: $200$ web-agent-style benchmark items (each a
question over $\sim$800 structured records), rendered four ways (clean, or
with exactly one defect injected into the instruction, reference, or
evaluator), so an audit can be scored mechanically against where we put the
defect. Across $2{,}400$ audits from three production models, we find that
the auditor's check of the reference answer is only as reliable as its own
ability to compute that answer itself. When the benchmark requires tallying
hundreds of records, detection of a wrong reference falls from $68\%$ to
$9\%$ and false positives on clean items rise from $44\%$ to $88\%$;
detection of a buggy evaluator, found by reading code rather than
recomputing, stays at $80\%$, so the failure is not general difficulty.
Reasoning traces and an answer-supplied probe converge on the mechanism: when
checking a reference answer, the auditor often re-solves the task and trusts
its own answer over the reference.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: benchmarking, evaluation methodologies, agent evaluation, explanation faithfulness, probing
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: Engish
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15335
Loading