Keywords: LLM-as-a-judge, uncertainty quantification, conformal prediction, autometic evaluation
TL;DR: Introduce conformal prediction to generate confidence intervals instead of unreliable ratings in LLM-as-a-judge.
Abstract: LLM-as-a-judge has become a promising paradigm for evaluating model generations, but the lack of reliability limits its deployment in applications. It has been very common to use LLMs in model evaluation while uncertainty quantification for rating evaluation remains underexplored. This work presents the first analysis framework to offer interval evaluations in LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low bias alternative to raw model score and weighted average. Extensive experiments and analysis across evaluators and conformal predictors show that our framework provides reliable uncertainty quantification for LLM-as-a-judge.
Submission Number: 101
Loading