Keywords: Evaluation, benchmarks, LLM
Abstract: The development of Large Language Models (LLMs) is advancing at a fast pace, and choosing the right benchmarks has become central to understanding and characterizing real progress. The community now faces an abundance of benchmarks. We often lack a systematic way to tell which benchmark requires more advanced skills, which provides cleaner separations between models, and which offers sufficient topical and linguistic coverage for a developer’s use case. This paper proposes a principled and quantitative answer. We introduce three metrics for benchmark quality, hardness, separability, and diversity, each with explicit mathematical definitions suitable for automated evaluation pipelines. We instantiate the framework across math, coding, knowledge, instruction following and argentic evaluation suites. We will also release the raw evaluation data to facility further studies. Together, these metrics and data enable systematic comparison and selection of the right benchmarks for model developers.
Submission Number: 166
Loading