Keywords: LLM evaluation, LLM-as-a-judge, graphical model, inference aggregation
TL;DR: We introduce CARE, a confounder-aware LLM-as-a-judge aggregation framework that explicitly models latent biases and reduces aggregation error by up to 25% for more accurate, robust evaluation.
Abstract: LLM-as-a-judge---often with multiple judges---is now the standard for scalable model evaluation, yet judge biases and correlations can amplify errors. We cast aggregation as inference in a latent-factor Markov random field that jointly models a latent true-quality variable, inter-judge correlations, and confounders (e.g., generation length). We address two key technical challenges---identifiability and learning a higher-rank latent structure---via **CARE**, a two-stage estimator that uses sparse+low-rank structure recovery and tensor decomposition to separate quality from spurious factors. This enables us to better understand the quality and behavior of judges, leading to improved evaluation capabilities. Empirically, it reduces aggregation error by up to **25.15\%** and seamlessly incorporates cheaply constructed programmatic judges, while matching or surpassing individual-judge intervention strategies.
Submission Number: 37
Loading