The Measure of All Measures: Quantifying LLM Benchmark Quality

Jihan Yao; Peter Jin; Ke Bao; Qiaolin Yu; Khushi Bhardwaj; Chang Su; Jialei Wang; YIKAI ZHU; Sugam Devare; Damon Mosk-Aoyama; Zhen Dong; Venkat Krishna Srinivasan; Yineng Zhang; Oleksii Kuchaiev; Jiantao Jiao; Banghua Zhu

The Measure of All Measures: Quantifying LLM Benchmark Quality

Jihan Yao, Peter Jin, Ke Bao, Qiaolin Yu, Khushi Bhardwaj, Chang Su, Jialei Wang, YIKAI ZHU, Sugam Devare, Damon Mosk-Aoyama, Zhen Dong, Venkat Krishna Srinivasan, Yineng Zhang, Oleksii Kuchaiev, Jiantao Jiao, Banghua Zhu

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Evaluation, benchmarks, LLM

TL;DR: We

Abstract: The development of Large Language Models (LLMs) is advancing at a fast pace, and choosing the right benchmarks has become central to understanding and characterizing real progress. The community now faces an abundance of benchmarks. We often lack a systematic way to tell which benchmark is harder, which provides cleaner separations between models, or which offers sufficient topical and linguistic coverage for a developer’s use case. This paper proposes a principled and quantitative answer. We introduce three metrics for benchmark quality, hardness, separability, and diversity, each with explicit mathematical definitions suitable for automated evaluation pipelines. We further derive a difficulty–aware leaderboard index that rewards solving genuinely hard items. We instantiate the framework across math, coding, knowledge, instruction following and agentic evaluation suites. Together, these metrics enable systematic comparison and selection of the right benchmarks for model developers.

Submission Number: 166

Loading