Abstract: Large Language Models (LLMs) are increasingly used as judges to evaluate code artifacts when exhaustive
human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic
software engineering workflows, where it can help rank candidate solutions and guide patch selection. While
attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of
the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving,
human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code
through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code
repair task, and test generation, and we systematically probe prompt-induced biases (e.g., position, verbosity,
authority/provenance, distraction, chain-of-thought, self-enhancement, and refined-version cues). Our study
considers difficulty levels for repeated runs and controlled prompt interventions that isolate one presentation
cue at a time, and it evaluates judges using consistency and sensitivity to bias.
We find that judge decisions are highly sensitive to prompt biases even when the underlying code snippet
is unchanged. Across all three tasks, several biases systematically shift preferences toward the option favored
by the prompt, improving accuracy when that option aligns with the gold answer but substantially reducing it
otherwise. In some settings, these effects are large enough to change task-level conclusions and alter relative
model rankings. These findings show that reported judge performance may reflect prompt artifacts rather
than stable assessment ability, posing a direct threat to the validity and reproducibility of code evaluation. We
therefore argue that LLM-as-a-Judge studies should report bias sensitivity alongside accuracy and incorporate
explicit controls, such as A/B order swapping and controlled prompt perturbations, to support more trustworthy
model comparison in software engineering
Loading