$\texttt{BetaConform}$: Efficient MAP Estimation of LLM Ensemble Judgment Performance with Prior Transfer

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM-as-a-Judge, Distribution Estimation, LLM Ensemble
TL;DR: We present an efficient judgmetn distribution estimation method for LLM ensembles.
Abstract: LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled $\textit{maximum a posteriori}$ (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present $\texttt{BetaConform}$, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. $\texttt{BetaConform}$ is also validated empirically. For instance, with only $10$ samples from the TruthfulQA dataset, for a Llama ensembled judge, $\texttt{BetaConform}$ gauges its performance with an error margin as small as $3.37\\%$.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 8307
Loading