Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

Huanxin Sheng; Xinyi Liu; Hangfeng He; Jieyu Zhao; Jian Kang

Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, Jian Kang

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, uncertainty quantification, conformal prediction, autometic evaluation

TL;DR: Introduce conformal prediction to generate confidence intervals instead of unreliable ratings in LLM-as-a-judge.

Abstract: LLM-as-a-judge has become a promising paradigm for evaluating model generations, but the lack of reliability limits its deployment in applications. It has been very common to use LLMs in model evaluation while uncertainty quantification for rating evaluation remains underexplored. This work presents the first analysis framework to offer interval evaluations in LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low bias alternative to raw model score and weighted average. Extensive experiments and analysis across evaluators and conformal predictors show that our framework provides reliable uncertainty quantification for LLM-as-a-judge.

Submission Number: 101

Loading