Asymmetric Scaling Laws from Sparse Features

John Sous

Asymmetric Scaling Laws from Sparse Features

John Sous

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling laws

TL;DR: Sparse features lead to asymmetrical scaling laws.

Abstract: We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold—where the number of parameters is just sufficient to fit the training data—resulting in a loss curve governed by two distinct scaling exponents—one for the overparameterized regime and one for the underparameterized regime— with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations. Experiments validating the theory can be found at SparseScaling.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 31

Loading