Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung; Faeze Brahman; Yejin Choi

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

Published: 22 Jan 2025, Last Modified: 24 Feb 2025ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, LLM, LLM Judge, Evaluation, Alignment

TL;DR: We propose Cascaded Selective Evaluation, an LLM-as-Judge framework that dynamically selects when to trust different judge models to reduce evaluation overhead, while providing a provable guarantee of human-judge agreement.

Abstract: We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this *selective evaluation* framework, human agreement can be provably guaranteed---such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce *Simulated Annotators*, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose *Cascaded Selective Evaluation*, where we use cheaper models as initial judges and escalate to stronger models only when necessary---again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, *guarantees* over 80% human agreement with almost 80% test coverage.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2430

Loading