Keywords: LLM evaluation, LLM-as-a-Judge, bias detection, bayesian modeling
TL;DR: A novel framework for the evaluation of LLMs-as-a-judge which allows researchers to detect, quantify and correct for various biases in a systematic way.
Abstract: The evaluation of large language models (LLMs) is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they are not free from biases (e.g., favouring longer outputs or generations from their own model family). Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to address their primary research questions (e.g., LLM capability or risk assessment), while simultaneously identifying, quantifying and mitigating various biases in their autograders. Our approach can be applied to various evaluation formats (e.g., absolute scores or pairwise preferences) and augments traditional metrics (e.g., inter-rater agreement) by providing precise uncertainty estimates and clarifying sources of disagreement between graders. This framework also enables efficient counterfactual simulations without costly re-evaluation (e.g., assessing agreement after removing systematic biases). We demonstrate these capabilities through simulated examples, with all methods available in an open-source software package. Overall, we introduce a novel framework for autograder evaluation which allows researchers to detect, quantify and correct for various biases in a systematic way.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 19956
Loading