DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

ACL ARR 2024 December Submission1335 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreement. This selective arbitration mechanism prioritizes evaluation reliability while reducing unnecessary computational demands. DAFE combines task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation , automatic creation and evaluation of language resources, human evaluation, automatic evaluation, evaluation and metrics,

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English

Submission Number: 1335

Loading