Smarter Sampling for LLM Judges: Reliable Evaluation on a Budget

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, LLM judge, Selection, Budget
TL;DR: We introduce a Chernoff-based lower bound on human annotations for evaluating LLM judges and propose data selection methods that improve ICC alignment by 41% using only a subset of labels.
Abstract: LLM-as-a-judge is increasingly dominant as a framework for scalable evaluation of artificial intelligence (AI) systems and agents. The technique involves prompting a large language model (LLM) to assess the capabilities of another AI model. Although the system reduces human annotation requirements, the need for human oversight is still required to gauge the performance of the judge LLM. However, human annotations can be expensive to obtain, particularly in domains that require expert annotations, such as clinical text generation. Thus, the problem drives the questions: (1) Can we bound the number of human annotations necessary to gauge the performance of our judge LLM? and (2) Can we curate the subset of data for human annotation in a principled way? In this paper, we answer (1) through a Chernoff bound for intraclass correlation coefficient (ICC), the primary metric for measuring LLM-as-judge performance relative to human labels. To explore (2), we propose $7$ sampling methods and demonstrate the utility of these algorithms relative to random sampling in simulated and real-world data. We show tighter bounds for sampling requirements and up to a 41\% relative improvement in ICC precision compared to random baselines.
Submission Number: 130
Loading