Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

Xiaochuan Li; Ke Wang; Girija Manash Gouda; Shubham Choudhary; Yaqun Wang; Linwei Hu; Joel Vaughan; Freddy Lecue

Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

Xiaochuan Li, Ke Wang, Girija Manash Gouda, Shubham Choudhary, Yaqun Wang, Linwei Hu, Joel Vaughan, Freddy Lecue

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language mode, LLM-as-judge, LLM-as-jury, Context-aware evaluation

TL;DR: Adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains

Abstract: As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand – a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a two-tiered adaptation - first selecting an optimal jury per dataset, then assigning dynamic, instance-specific weights to its members. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.

Primary Area: generative models

Submission Number: 12346

Loading