Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

ACL ARR 2025 July Submission668 Authors

28 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (\eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (\eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (\eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few ``authority'' models. To tackle these issues, we propose Decentralized Arena (\dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, \dearena attains up to 97\% correlation with human judgements, while significantly reducing the cost.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, evaluation

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 668

Loading