Bench-O-Matic: Automating Benchmark Curation from Crowdsourced Data

Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica

Bench-O-Matic: Automating Benchmark Curation from Crowdsourced Data

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Evaluation

TL;DR: Scalable curation of high-quality, automated benchmarks from extensive data without human in the loop.

Abstract: The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce Bench-O-Matic, an automated pipeline that leverages LLMs to curate high-quality, open- ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply Bench-O-Matic to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Eval-O-Matic, a benchmark consisting 500 challenging prompts curated by Bench-O-Matic. Eval-O-Matic provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13585

Loading