Auditing by Re-Solving: LLM Benchmark Auditors Trust Their Own Answer Over the Reference

Auditing by Re-Solving: LLM Benchmark Auditors Trust Their Own Answer Over the Reference

ACL ARR 2026 May Submission15335 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarking, explanation faithfulness, evaluation methodologies

Abstract: Benchmark items have several parts --- an instruction, source records, a reference answer, and evaluator code --- and any of them can be wrong. A natural response is to ask an LLM to audit the item; but this creates a circular measurement problem, since evaluating the auditor seems to require another auditor. We sidestep this by constructing audit tasks whose correct verdict is known by design: $200$ web-agent-style benchmark items (each a question over $\sim$800 structured records), rendered four ways (clean, or with exactly one defect injected into the instruction, reference, or evaluator), so an audit can be scored mechanically against where we put the defect. Across $2{,}400$ audits from three production models, we find that the auditor's check of the reference answer is only as reliable as its own ability to compute that answer itself. When the benchmark requires tallying hundreds of records, detection of a wrong reference falls from $68\%$ to $9\%$ and false positives on clean items rise from $44\%$ to $88\%$; detection of a buggy evaluator, found by reading code rather than recomputing, stays at $80\%$, so the failure is not general difficulty. Reasoning traces and an answer-supplied probe converge on the mechanism: when checking a reference answer, the auditor often re-solves the task and trusts its own answer over the reference.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: benchmarking, evaluation methodologies, agent evaluation, explanation faithfulness, probing

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: Engish

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 15335

Loading