From Many Voices to One: A Statistically Principled Aggregation of LLM Judges

Jitian Zhao; Changho Shin; Tzu-Heng Huang; Satya Sai Srinath Namburi GNVV; Frederic Sala

From Many Voices to One: A Statistically Principled Aggregation of LLM Judges

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM as a judge, graphical model

TL;DR: We introduce CARE, a confounder-aware LLM-as-a-judge aggregation framework that explicitly models latent biases and reduces aggregation error by up to 25% for more accurate, robust evaluation.

Abstract: LLM-as-a-judge—often with multiple judges---is now the standard paradigm for scalable model evaluation. This strategy is known to suffer from biases, spurious correlations, confounding factors, etc., and many heuristic approaches have been proposed to tackle these. We address this problem from the point of view of probabilistic graphical models, enabling us to \textbf{\emph{capture the challenges involved in using multiple judges in a principled way}}. By considering Markov random fields (MRF) with multiple latent factors, we can model undesired correlations between judges, a latent unknown true notion of quality, and one or more additional latent distractors (for example, generation length). The key technical challenge is to identify and learn a higher-rank latent variable MRF, which we solve via a new approach that mixes sparse plus low-rank and tensor decompositions. This enables us to better understand the quality and behavior of judges, leading to improved evaluation capabilities. In addition, we show how to augment our approach via programmatic judges that can be cheaply constructed and added to standard model-based judges. Empirically, our framework, CARE (Confounder-Aware Aggregation for Reliable Evaluation), demonstrates consistent gains on diverse public benchmarks, reducing aggregation error by up to 25.15\% and showing robust integration of programmatic judges. Additionally, CARE offers superior performance and efficiency compared to individual-judge intervention strategies. These results underscore CARE's ability to effectively model correlations and mitigate biases, leading to more accurate and robust aggregation of LLM judge scores.

Submission Number: 53

Loading