From Many Voices to One: Statistically Principled Aggregation of LLM Judges

Jitian Zhao; Changho Shin; Tzu-Heng Huang; Satya Sai Srinath Namburi GNVV; Frederic Sala

From Many Voices to One: Statistically Principled Aggregation of LLM Judges

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

Published: 24 Sept 2025, Last Modified: 23 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM evaluation, LLM-as-a-judge, graphical model, inference aggregation

TL;DR: We introduce CARE, a confounder-aware LLM-as-a-judge aggregation framework that explicitly models latent biases and reduces aggregation error by up to 25% for more accurate, robust evaluation.

Abstract: LLM-as-a-judge---often with multiple judges---is now the standard for scalable model evaluation, yet judge biases and correlations can amplify errors. We cast aggregation as inference in a latent-factor Markov random field that jointly models a latent true-quality variable, inter-judge correlations, and confounders (e.g., generation length). We address two key technical challenges---identifiability and learning a higher-rank latent structure---via **CARE**, a two-stage estimator that uses sparse+low-rank structure recovery and tensor decomposition to separate quality from spurious factors. This enables us to better understand the quality and behavior of judges, leading to improved evaluation capabilities. Empirically, it reduces aggregation error by up to **25.15\%** and seamlessly incorporates cheaply constructed programmatic judges, while matching or surpassing individual-judge intervention strategies.

Submission Number: 37

Loading