Why Difficulty Alignment Matters for Model Evaluation: Capability Alignment in LLM Benchmarks

Shady Ali; Dongyeop Kang

Why Difficulty Alignment Matters for Model Evaluation: Capability Alignment in LLM Benchmarks

Shady Ali, Dongyeop Kang

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM Benchmark Evaluation, Capability Alignment, Benchmark Quality Metrics, Difficulty–Capability Interaction, Discriminability and Saturation

TL;DR: Benchmark quality metrics are not intrinsic dataset properties; they depend on the alignment between item difficulty and the capability distribution of evaluated models, a principle we formalize and operationalize with a new diagnostic score (CAS).

Abstract: Benchmark quality metrics such as discriminability and saturation are typically reported as stable properties of datasets. We argue they are not: these metrics are computed on specific model populations and vary substantially across them. This population-dependence is rarely acknowledged in benchmark reports or leaderboards, yet it is a fundamental source of variation in how benchmark quality should be interpreted. We formalize this position as the Capability Alignment Hypothesis: benchmark informativeness depends on the alignment between item difficulty and the capability distribution of evaluated models. Empirically, we show that discriminability follows an inverted-U relationship with difficulty, where items that are too easy or too hard for a given population yield weak discrimination. We introduce the Capability Alignment Score (CAS), combining difficulty alignment and ability-consistent discrimination, as a complementary diagnostic signal alongside existing metrics. Experiments across math and reasoning benchmarks confirm that CAS captures alignment-related structure not fully reflected in current measures.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Provocation

Archival Status: Archival

Submission Number: 89

Loading