Submission Track: Paper Track (Tiny Paper)
Submission Category: AI-Guided Design
Keywords: LLM, Benchmark, IRT, Evaluation
TL;DR: We reveal how hidden implementation choices in LLM benchmarks bias model rankings and propose item response theory (IRT) as a solution for more transparent and reliable evaluations.
Abstract: The evaluation of large language models (LLMs) through benchmarks has become a cornerstone of AI development, guiding critical decisions about model deployment and research directions.
However, as benchmarks evolve from narrow task-specific assessments to broad capability evaluations, they become more difficult to develop, understand and analyze.
Here, we report a \enquote{benchmark iceberg} phenomenon --- where much of the variability in model rankings stems not from true capability differences, but from hidden implementation choices beneath the surface of reported scores. Our analysis demonstrates how minor changes to these implementation details can alter model rankings --- a concerning finding given benchmarks' role in shaping the AI landscape.
To address this, we leverage psychometric principles from educational testing. By adapting item response theory (IRT) we transform benchmarks from opaque leaderboards into transparent measurement instruments, revealing how hidden implementation choices currently distort our perception of model capabilities.
AI4Mat Journal Track: Yes
Submission Number: 27
Loading