LLM-as-a-Judge on a Budget

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Given a fixed budget for LLM evaluation queries, how should you allocate them? We present a theoretically-grounded, variance-adaptive approach that significantly outperforms uniform sampling across diverse evaluation tasks.
Abstract: LLM-as-a-judge has emerged as a cornerstone technique for automatic evaluation of large language models (LLMs). The key idea is to leverage the reasoning capabilities of LLMs to evaluate prompt-response pairs using rationales paired with numeric scores, and thus combining the comprehensiveness of human evaluation with automated metrics. The rationales and scores are sampled from an LLM and thus random. To get a more precise estimate of the mean score generated by the LLM, a common practice is to evaluate each prompt-response pair multiple times. Therefore, practitioners face the following critical challenge: given a fixed computational budget, how to optimally allocate LLM judgments across prompt-response pairs to estimate the mean scores as precisely as possible? We present a principled variance-adaptive approach that addresses this fundamental problem by leveraging insights from multi-armed bandit (MAB) theory and concentration inequalities. Our method dynamically allocates LLM judgements based on the estimated variance of LLM scores for each prompt-response pair, concentrating computational resources on the most uncertain scores. We prove that our algorithm achieves a near-optimal sample complexity for minimizing the worst-case estimation error across prompt-response pairs, providing theoretical guarantees for practitioners working under budget constraints. Extensive experiments on two popular evaluation datasets, *Summarize from Feedback* and *HelpSteer2*, show that our method significantly reduces the worst-case estimation error of the estimated mean scores while maintaining a fixed query budget. Our results establish a novel theoretical foundation for efficient LLM judges and provide a practical guidance for deploying such evaluation pipelines at scale, with broad implications to AI safety, model development, and automated assessment in our increasingly LLM-driven world.
Submission Number: 1912
Loading