Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

ACL ARR 2025 May Submission2264 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLM-as-a-judge has become a promising paradigm for evaluating natural language generation (NLG), but the lack of reliability limits its deployment in high-risk applications. It has been very common to use LLMs to directly evaluate LLM-generated content while uncertainty quantification for rating evaluation remains underexplored. This work presents the first analysis framework to offer interval evaluations in LLM-based scoring via conformal prediction. Conformal prediction constructs continuous confidence intervals from a single evaluation run and we design a ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. Extensive experiments and analysis across evaluators and conformal prediction methods show that our framework yields narrow intervals with reliable coverage, enabling more trustworthy evaluation for downstream decision making.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: LLM-as-a-judge, uncertainty quantifictaion, conformal prediction, autometic evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 2264
Loading