Keywords: evaluations, benchmarks, scaling laws, emergent abilities, capabilities, frontier models, foundation models
TL;DR: What makes predicting downstream capabilities of frontier AI models with scale difficult?
Abstract: Predictable behavior from scaling AI systems is a desirable property. While a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities change with scale is muddier: previous papers debated the origins of emergent abilities, and recent work claimed that specific downstream capabilities become predictable only beyond a specific pretraining loss or if aggregated across dozens of benchmarks. In this work, we ask: \textit{what makes predicting specific downstream capabilities with scale difficult?} We identify a critical factor contributing to this difficulty on multiple-choice benchmarks. Using five model families and twelve widely-used benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorates the statistical relationship between performance and scale. We demonstrate that this deterioration is caused by metrics that require comparing the correct choice against a small number of specific incorrect choices, meaning that predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct behavior with scale, but also how probability mass changes on specific incorrect behaviors with scale. We empirically study how probability mass on the correct choice covaries with mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work explains why pretraining scaling laws are regarded as more predictable and contribute towards establishing scaling-predictable evaluations of AI models.
Submission Number: 25
Loading