Abstract: Large language models (LLMs) have become prevalent in natural language processing, with researchers increasingly using them as automated evaluators through the LLM-as-a-judge paradigm. However, current implementations primarily rely on proprietary models, raising concerns about accessibility, costs, and data privacy. Additionally, existing LLM judges exhibit various biases that can compromise evaluation quality. We systematically investigate whether general-purpose open LLMs, without specific fine-tuning for evaluation tasks, can serve as reliable alternatives to proprietary models. We conduct comprehensive assessments across established benchmarks and analyze their susceptibility to different biases. Our findings demonstrate that certain open models can match or exceed the performance of proprietary alternatives, providing a systematic methodology for selecting appropriate open-source evaluators while maintaining high standards of assessment quality.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 4963
Loading