SAGE: LLM-Based Evaluation Through Selective Aggregation for Free-Form Question-Answering

ACL ARR 2025 May Submission3880 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or accommodate the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive at scale. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. To harness these strengths efficiently, we propose the Selective Aggregation for Generative Evaluation (SAGE), which employs two primary LLMs as judges and engages a third judge only in cases of disagreement. This selective aggregation prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. SAGE incorporates task-specific reference answers to improve judgment accuracy, leading to substantial gains in evaluation metrics such as Macro F1 and Cohen’s Kappa. Through experiments, including human evaluation, we demonstrate SAGE’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation , automatic creation and evaluation of language resources, human evaluation, automatic evaluation, evaluation and metrics
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3880
Loading