Towards Scalable Oversight: Meta-Evaluation of LLMs as Evaluators via Agent Debate

Steffi Chern; Ethan Chern; Graham Neubig; Pengfei Liu

Towards Scalable Oversight: Meta-Evaluation of LLMs as Evaluators via Agent Debate

Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu

Published: 20 Dec 2024, Last Modified: 01 Mar 2025AI4Research @ AAAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: meta-evaluation, multi-agent debate, human annotation

TL;DR: We propose ScaleEval, an agent-debate-assisted meta-evaluation framework to assist human annotators in discerning the capabilities and limitations of LLMs as evaluators.

Abstract: Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, existing meta-evaluation methods to assess the effectiveness of LLMs as evaluators is typically constrained by the coverage of existing benchmarks or require extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist humans in discerning the capabilities and limitations of LLMs as evaluators, which significantly reduces their workload in cases that used to require much supervision and large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 38

Loading