Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

ACL ARR 2024 June Submission1601 Authors

14 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As Large Language Models (LLMs) have become more advanced, they have outpaced the ability to accurately evaluate them. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's free-form generation alone is a challenge. To address this, many evaluations now rely on LLMs themselves to judge the quality of outputs from other LLMs, typically using a single large model like GPT-4. While this method has grown in popularity, it is costly, introduces intra-model bias, and in this work, we find that the largest models are often unnecessary. Instead, we propose evaluation with a Panel of LLm evaluators (PoLL) composed of a larger number of smaller models. Across three distinct judge settings and spanning six different datasets, we find that PoLL outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
Paper Type: Short
Research Area: Generation
Research Area Keywords: automatic evaluation,
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: english
Submission Number: 1601
Loading