Can we employ LLM to meta-evaluate LLM-based evaluators? A Preliminary Study

Can we employ LLM to meta-evaluate LLM-based evaluators? A Preliminary Study

ACL ARR 2024 June Submission460 Authors

11 Jun 2024 (modified: 09 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Large language models (LLMs) are frequently employed to evaluate the instruction-following abilities of other LLMs. A number of recent work focuses on the meta-evaluation of LLM-based evaluation, aiming to understand the efficacy of LLMs as evaluators. However, these studies are limited by the scope of existing benchmarks and the extensive human annotation efforts. Since previous studies show that strong LLMs can effectively evaluatethe instruction-following abilities of other LLMs, a natural question is whether we can use LLMs to meta-evaluate the evaluation abilities of other LLMs by considering LLM-based evaluation as special case of instruction-following tasks. In this work,we investigate the potential of LLMs to conduct meta-evaluation and examine the extent to which the proficiency of the model and the scale of the model impact this meta-evaluation capacity. To this end, we introduce four frameworks within the paradigms of pairwise comparison (JDEval and MDEval) and individual scoring (JDEval-i and BSMEval). Through our experiments, we find that pairwise comparison paradigm is more suitable to conduct meta-evaluation than individual scoring paradigm. JDEval and MDEval have demonstrated strong performance in meta-evaluation tasks, showing high agreement with human annotations.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies; evaluation

Contribution Types: Surveys

Languages Studied: English

Submission Number: 460

Loading