Quantifying biases in LLM-as-Judge evals

Quantifying biases in LLM-as-Judge evals

ICLR 2026 Conference Submission19956 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM evaluation, LLM-as-a-Judge, bias detection, bayesian modeling

TL;DR: A novel framework for the evaluation of LLMs-as-a-judge which allows researchers to detect, quantify and correct for various biases in a systematic way.

Abstract: The evaluation of large language models (LLMs) is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they are not free from biases (e.g., favouring longer outputs or generations from their own model family). Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to address their primary research questions (e.g., LLM capability or risk assessment), while simultaneously identifying, quantifying and mitigating various biases in their autograders. Our approach can be applied to various evaluation formats (e.g., absolute scores or pairwise preferences) and augments traditional metrics (e.g., inter-rater agreement) by providing precise uncertainty estimates and clarifying sources of disagreement between graders. This framework also enables efficient counterfactual simulations without costly re-evaluation (e.g., assessing agreement after removing systematic biases). We demonstrate these capabilities through simulated examples, with all methods available in an open-source software package. Overall, we introduce a novel framework for autograder evaluation which allows researchers to detect, quantify and correct for various biases in a systematic way.

Supplementary Material: pdf

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 19956

Loading