Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

ICLR 2026 Conference Submission19056 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-Judge, Hypothesis testing, Finite-sample guarantees, Type I/II errors

TL;DR: A statistically grounded framework is proposed for evaluating LLMs as imperfect judges, offering finite-sample guarantees by modeling judge reliability through hypothesis testing.

Abstract: With the rapid proliferation of large language models (LLMs) across diverse applications, the need for evaluation procedures that provide statistical guarantees of reliability is becoming increasingly pressing. Yet current evaluation and monitoring systems—such as LLM judges validated on only a small number of human-annotated examples—suffer from poor calibration, leading to inadequate certification of LLM performance and, in turn, eroding trust in both evaluation frameworks and the models themselves. We address this challenge by introducing a principled evaluation framework for LLM-as-a-Judge settings that leverages hypothesis testing with finite-sample, population-level guarantees. Our approach reformulates standard hypothesis tests into proxy noisy tests that explicitly account for judge imperfections through two key parameters: the true positive rate (TPR) and false positive rate (FPR). These parameters are estimated using a small human-labeled dataset, while test statistics are computed on a large collection of noisy judge-labeled data. This design contrasts with prediction-powered inference (PPI) frameworks where human labels are used to exclusively model the judge rather than for prediction correction. We provide theoretical analysis—including a full characterization of type I and type II error probabilities, and conditions under which valid evaluation is possible—and empirical validation across multiple datasets including Jigsaw Comment, Hate Speech, and SafeRLHF. Our experiments show that while noise-aware LLM evaluation procedures (including ours) outperform direct hypothesis testing, there is a considerable performance gap with the setting when the judge noise can be completely observed. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19056

Loading