Keywords: LLM-as-a-judge, conformal prediction, transitivity violations, natural language generation evaluation, uncertainty quantification
TL;DR: We introduce transitivity analysis and conformal prediction sets to diagnose the per-instance reliability of LLM judges, finding that the evaluation criterion determines reliability more than the specific model used.
Abstract: LLM-as-judge frameworks are increasingly used for automatic NLG evaluation,
yet their per-instance reliability remains poorly understood.
We present a two-pronged diagnostic toolkit applied to SummEval:
$\textbf{(1)}$ a transitivity analysis that reveals widespread per-input
inconsistency masked by low aggregate violation rates
($\bar{\rho} = 0.8$--$4.1\%$), with $33$-$67\%$ of documents
exhibiting at least one directed 3-cycle; and
$\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores
providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage,
with set width serving as a per-instance reliability indicator
($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges).
Critically, prediction set width shows consistent cross-judge agreement
($\bar{r} = 0.32$--$0.38$), demonstrating it captures
document-level difficulty rather than judge-specific noise.
Across four judges and four criteria, both diagnostics converge:
criterion matters more than judge,
with relevance judged most reliably
(avg. set size $\approx 3.0$) and coherence moderately so
(avg. set size $\approx 3.9$), while fluency and consistency
remain unreliable (avg. set size $\approx 4.9$).
We release all code, prompts, and cached results.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 29
Loading