Scaling Theory for SlowRunning: Model size, Ensembling, and Training Horizon in the Multi-Epoch Regime
Keywords: Scaling Law, SGD, Multi-Epoch training, Generalization, Dynamical Mean Field Theory, Random Matrices
TL;DR: A theory of multi-pass SGD and ensembling for structured random features
Abstract: We study the learning dynamics of multi-epoch training both empirically and theoretically. Consistent with empirical works on language model training, naive scaling of training epochs and model size fail to deliver monotonic improvements in performance under multiple passes. However, denoising methods such as ensembling, regularization, and averaging over data shuffles can improve performance in the multi-epoch regime. We theoretically analyze multi-epoch training in a solvable powerlaw random feature model using dynamical mean field theory. This theory predicts how train and test loss evolve over iterations of SGD within-epoch and across epochs. We show that SGD noise adds variance across steps within epochs while systematic overfitting effects arise from the \textit{cross-epoch} correlations in gradients which build up as response functions in the theory. Using the model, we provide an analysis of ensembling, model size, and SGD noise. We then conduct experiments in language model pretraining where we show that, in regimes where learning curves are non-monotonic, increasing ensembles can be preferable to increasing width at fixed compute. Using our model, we provide a theoretical argument for this account.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 64
Loading