VARCO Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models

ACL ARR 2025 February Submission2434 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most existing benchmarking approaches for evaluating the output quality of large language models (LLMs) rely on comparing LLM responses to predefined references. Such methods, based on static datasets, quickly become outdated as LLM capabilities and use cases evolve. In this work, we introduce VARCO Arena—a novel, cost-effective, and robust benchmarking approach that leverages a single-elimination tournament structure to minimize the number of required comparisons while eliminating the need for static references or costly human annotations. We validate our approach through two experiments: (i) a simulation study that examines its robustness under various conditions, and (ii) an empirical evaluation using publicly available benchmark prompts. In both experiments, VARCO Arena consistently outperforms current LLM benchmarking practices, achieving stronger correlations with human-established Elo ratings. Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation across diverse, customized use cases.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Resources and Evaluation, Benchmarking, evaluation methodologies, evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2434
Loading