Abstract: The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (\eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (\eg Chatbot Arena) rely on costly and slow human judges.
Recently, automated methods (\eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few ``authority'' models. To tackle these issues, we propose Decentralized Arena (\dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions.
Across extensive experiments across 66 LLMs, \dearena attains up to 97\% correlation with human judgements, while significantly reducing the cost.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 668
Loading