Abstract: Evaluating the performance of large language models (LLMs) in diverse domains has been a significant challenge due to the limitations of traditional evaluation metrics and the high cost of manual annotation. This paper introduces the Reference-based LLM-as-Evaluator (Ref-Eval) framework, which leverages the strengths of LLMs in text comprehension and instruction-following to assess model responses. The Ref-Eval framework employs a multi-round dialogic evaluation process, condensing extensive external references into distinct knowledge units, clustering them for efficient evaluation, and iteratively refining questions based on model responses. Experimental results on multiple domain-specific text datasets demonstrate that Ref-Eval achieves a high consistency with human evaluation, saving computational resources and enhancing evaluation accuracy. This approach not only addresses the limitations of existing LLM evaluation methods but also provides a scalable and efficient way to assess model performance in knowledge-intensive tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation, automatic evaluation of datasets
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 189
Loading