Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
Abstract: As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation.
Current benchmarks typically compare system outputs against reference outputs from an encore model.
This encore-mediated approach, though convenient, yields lower reliability than direct comparison between systems.
We propose Arena-lite, which combines direct head-to-head comparison of outputs from competing systems with a tournament structure, eliminating the need for encore outputs, reducing the number of required comparisons, and achieving higher reliability in system rankings.
We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge.
Those experiments collectively demonstrate that Arena-lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges.
We release an easy-to-use web demonstration and code to foster adoption of Arena-lite, streamlining model selection across research and industry communities.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, NLP Applications, Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: Resources and Evaluation, NLP Applications, Efficient/Low-Resource Methods for NLP
Submission Number: 6634
Loading