Dimension-adapted Momentum Outscales SGD

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: power-law random features, scaling laws, momentum, random matrix theory, acceleration, stochastic optimization, compute-optimal
TL;DR: We prove that stochastic momentum can improve the scaling law exponents over SGD on power-law random features by selecting hyperparameters to properly depend on data dimension or model size.
Abstract: We investigate scaling laws for stochastic momentum algorithms on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 4253
Loading