Keywords: LLM Benchmark Evaluation, Capability Alignment, Benchmark Quality Metrics, Difficulty–Capability Interaction, Discriminability and Saturation
TL;DR: Benchmark quality metrics are not intrinsic dataset properties; they depend on the alignment between item difficulty and the capability distribution of evaluated models, a principle we formalize and operationalize with a new diagnostic score (CAS).
Abstract: Benchmark quality metrics such as discriminability and saturation are typically reported as stable properties of datasets. We argue they are not: these metrics are computed on specific model populations and vary substantially across them. This population-dependence is rarely acknowledged in benchmark reports or leaderboards, yet it is a fundamental source of variation in how benchmark quality should be interpreted.
We formalize this position as the Capability Alignment Hypothesis: benchmark informativeness depends on the alignment between item difficulty and the capability distribution of evaluated models. Empirically, we show that discriminability follows an inverted-U relationship with difficulty, where items that are too easy or too hard for a given population yield weak discrimination. We introduce the Capability Alignment Score (CAS), combining difficulty alignment and ability-consistent discrimination, as a complementary diagnostic signal alongside existing metrics. Experiments across math and reasoning benchmarks confirm that CAS captures alignment-related structure not fully reflected in current measures.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Provocation
Archival Status: Archival
Submission Number: 89
Loading