LLM-as-a-Judge on a Budget

Published: 03 Feb 2026, Last Modified: 02 May 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Given a fixed budget for LLM evaluation queries, how should you allocate them? We present a theoretically-grounded, variance-adaptive approach that significantly outperforms uniform sampling across diverse evaluation tasks.
Abstract: LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, focusing resources where uncertainty is highest. Our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. Experiments on HelpSteer2 dataset demonstrate our method significantly outperforms uniform allocation, reducing worst-case estimation error given a fixed budget. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.
Code Dataset Promise: No
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 1912
Loading