Benchmarking Open LLMs for Automated Evaluation: Towards Reliable and Accessible Model Assessment

Benchmarking Open LLMs for Automated Evaluation: Towards Reliable and Accessible Model Assessment

ACL ARR 2025 February Submission4963 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have become prevalent in natural language processing, with researchers increasingly using them as automated evaluators through the LLM-as-a-judge paradigm. However, current implementations primarily rely on proprietary models, raising concerns about accessibility, costs, and data privacy. Additionally, existing LLM judges exhibit various biases that can compromise evaluation quality. We systematically investigate whether general-purpose open LLMs, without specific fine-tuning for evaluation tasks, can serve as reliable alternatives to proprietary models. We conduct comprehensive assessments across established benchmarks and analyze their susceptibility to different biases. Our findings demonstrate that certain open models can match or exceed the performance of proprietary alternatives, providing a systematic methodology for selecting appropriate open-source evaluators while maintaining high standards of assessment quality.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, Generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 4963

Loading