Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics

ACL ARR 2025 February Submission5013 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models are increasingly used as evaluators in natural language generation tasks, offering scalability and interpretability advantages over traditional evaluation methods. However, current LLM-based evaluations often suffer from biases and misalignment, particularly in domain-specific tasks, due to limited functional understanding and knowledge gaps. To address these challenges, we introduce the Co-Eval framework, which employs a criteria planner model and optimized machine metric to improve scalability, fairness of LLM-based evaluation. Experimental results on both general and domain-specific tasks show that Co-Eval reduces biases across LLMs by up to 0.4903 in self-preference bias and improves alignment with human preferences by up to 0.324 in Spearman correlation.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, evaluation, statistical testing for evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 5013
Loading