Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

ACL ARR 2024 August Submission271 Authors

15 Aug 2024 (modified: 09 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid advancements in Large Language Models (LLMs) have highlighted the critical need for robust evaluation methods that can accurately assess the quality of generated text, particularly in open-ended tasks. Traditional metrics like BLEU and ROUGE, while useful, often fail to capture the semantic richness and contextual relevance. In this study, we introduce a reference-guided verdict method that leverages multiple LLMs-as-judges to provide a more reliable and accurate evaluation of free-form outputs. By integrating diverse LLMs, our approach mitigates individual model biases and significantly improves alignment with human judgments, especially in challenging tasks where traditional metrics and single-model evaluations fall short. Through experiments across multiple QA tasks, we demonstrate that our method closely aligns with human evaluations, establishing it as a scalable, reproducible, and effective alternative to human evaluation. Our approach not only enhances evaluation reliability but also opens new avenues for refining automated assessment in NLP, emphasizing the importance of model diversity and task complexity.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Generation, NLP Applications, Resources and Evaluation, Dialogue and Interactive Systems
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 271
Loading