Keywords: Evaluation, benchmarks, LLM
TL;DR: We
Abstract: The development of Large Language Models (LLMs) is advancing at a fast pace, and choosing the right benchmarks has become central to understanding and characterizing real progress. The community now faces an abundance of benchmarks. We often lack a systematic way to tell which benchmark is harder, which provides cleaner separations between models, or which offers sufficient topical and linguistic coverage for a developer’s use case. This paper proposes a principled and quantitative answer. We introduce three metrics for benchmark quality, hardness, separability, and diversity, each with explicit mathematical definitions suitable for automated evaluation pipelines. We further derive a difficulty–aware leaderboard index that rewards solving genuinely hard items. We instantiate the framework across math, coding, knowledge, instruction following and agentic evaluation suites. Together, these metrics enable systematic comparison and selection of the right benchmarks for model developers.
Submission Number: 166
Loading