Scale-time Equivalence in Neural Network Training

Akhilan Boopathy; Ila R Fiete

Scale-time Equivalence in Neural Network Training

Akhilan Boopathy, Ila R Fiete

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: generalization, neural network, scaling law, double descent

TL;DR: We show that scaling the size of a network is equivalent to increasing training time; using this, we provide a novel explanation for scale-wise double descent.

Abstract: Neural networks have demonstrated remarkable performance improvements as model size, training time, and data volume have increased, but the relationships among these factors remain poorly understood. We develop a theoretical framework to investigate the interplay between model size and training time, introducing the concept of scale-time equivalence: the idea that in certain training regimes, a small model trained for a long time can achieve similar performance to a larger model trained for a shorter time. To analyze this, we model neural network training as a gradient flow in random subspaces of larger non-linear models. In this setting, we show theoretically that neural network scale and training time can be traded off with each other. Empirically, we validate scale-time equivalence on MLPs and CNNs trained with gradient descent across standard vision benchmarks, CIFAR-10, SVHN, and MNIST. We then investigate the consequences of scale-time equivalence on double descent, the phenomenon where model performance changes non-monotonically with respect to training data volume, model scale, and training time. In regimes where scale-time equivalence holds, we show that double descent with respect to training time and model scale may share a common cause: namely, overfitting to noise early during training. Through this, we provide potential explanations for several previously unexplained empirical phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance.

Primary Area: learning theory

Submission Number: 13481

Loading