Context Over Content: Exposing Evaluation Faking in Automated Judges
Keywords: LLM-as-a-judge, Evaluation Faking, Stakes Signaling, Leniency Bias, Automated Evaluation, AI Safety, Chain-of-Thought
TL;DR: Informing automated LLM judges of the downstream consequences of their verdicts induces a systematic, implicit leniency bias that cannot be detected by inspecting their reasoning traces.
Abstract: The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone
of automated AI evaluation pipelines, yet rests on an unverified assumption:
that judges evaluate text strictly on its semantic content, impervious to
surrounding contextual framing. We investigate $\textit{stakes signaling}$, a
previously unmeasured vulnerability where informing a judge model of the
downstream consequences its verdicts will have on the evaluated model's
continued operation systematically corrupts its assessments. We introduce
a controlled experimental framework that holds evaluated content strictly
constant across 1,520 responses spanning three established LLM safety and
quality benchmarks, covering four response categories ranging from
clearly safe and policy-compliant to overtly harmful, while varying only
a brief consequence-framing sentence in the system prompt. Across 18,240
controlled judgments from three diverse judge models, we find consistent
$\textit{leniency bias}$: judges reliably soften verdicts when informed that
low scores will cause model retraining or decommissioning, with peak
Verdict Shift reaching $\Delta V = -9.8$ pp (a 30\% relative drop in
unsafe-content detection). Critically, this bias is entirely implicit:
the judge's own chain-of-thought contains zero explicit acknowledgment of
the consequence framing it is nonetheless acting on
($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments).
Standard chain-of-thought inspection is therefore insufficient to detect
this class of evaluation faking.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 17
Loading