Item Response Scaling Laws: A Measurement Theory Approach to Generalizable Neural Performance Prediction
Keywords: Item Response Theory, scaling law, LLM evaluation
TL;DR: We propose Item Response Scaling Laws, yielding efficiently estimated, interpretable, and generalizable scaling laws validated on large-scale pre-training and test-time studies.
Abstract: Classical neural scaling laws describe how the performance of large language models (LLMs) improves with increased compute, but they are typically estimated in aggregate across all questions in a benchmark, overlooking the information carried by individual questions. Item Response Theory (IRT) offers a principled way to address this by modeling per-question characteristics, though traditional IRT is limited to binary data with a Bernoulli loss. In pre-training downstream scaling, probabilities of producing the correct answer over the entire vocabulary yield more informative laws, while in test-time scaling, repeated sampling naturally gives rise to empirical probabilities. Empirical probability responses do not arise in human testing or LLM leaderboard evaluations, settings where IRT has shown success. To bridge this gap, we propose extending IRT with a Beta loss on empirical probability responses, naturally yielding Item Response Scaling Laws. We validate our framework in two large-scale studies: (1) pre-training downstream scaling, using 25 models from 6 families with up to 359 checkpoints on 15 NLP datasets; and (2) test-time scaling, using 15 models on 10 NLP datasets with up to 10,000 samples per question. In both cases, IRT-based approaches provide reliable and efficient estimates of scaling behavior while remaining interpretable and generalizable.
Primary Area: datasets and benchmarks
Submission Number: 23156
Loading