From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Arena-Hard-Auto: An automatic LLM benchmark.

Abstract: The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

Lay Summary: This paper introduces BenchBuilder, an automated method for creating high-quality benchmarks for evaluating Large Language Models (LLMs) without costly human intervention. BenchBuilder takes crowdsourced queries and identifies challenging, prompts that clearly differentiate model capabilities. The resulting benchmark, Arena-Hard-Auto, achieves greater accuracy in distinguishing model performance and closely matches human preference rankings, outperforming existing benchmarks like MT-Bench at a significantly reduced cost. This method sets a new standard for scalable, reliable evaluation of advanced AI models.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/lmarena/arena-hard-auto

Primary Area: General Machine Learning->Evaluation

Keywords: llm, evaluations

Submission Number: 12667

Loading