SMARTER SAMPLING FOR LLM JUDGES: RELIABLE EVALUATION ON A BUDGET

ICLR 2026 Conference Submission20971 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, data-centric, confidence, LLM-judge
Abstract: Large language models (LLMs) are increasingly employed as judges for scalable evaluation of AI systems, where an LLM is prompted to assess the outputs of another model. This approach is particularly valuable for tasks with non-verifiable answers, but its reliability ultimately depends on alignment with human judgments. Because human annotations are expensive and time-consuming, especially in domains that demand expert knowledge such as clinical text generation, it is essential to reduce annotation effort while maintaining accurate estimates of judge reliability. In this work, we study the problem of estimating the intraclass correlation coefficient (ICC) between LLM judges and humans under limited annotation budgets. We derive Chernoff bounds on the estimation error, providing theoretical guarantees on sample requirements and reducing sample size requirements by an average of 18\% compared to the baseline. Building on this, we propose and evaluate $6$ sampling strategies designed to identify the most informative examples for annotation. Experiments on $4$ diverse real-world datasets demonstrate that our methods yield narrower confidence intervals and achieve relative improvements of 5.5\%–31\% in ICC precision over random sampling baselines.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20971
Loading