Understanding Language Model Scaling Laws in Terms of Training Dynamics via Loss Deceleration and Zero-Sum Learning
Abstract: This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements.
We find that these can be tied back to loss deceleration: an abrupt slowdown in the rate of loss improvement early in training, characterized by piece-wise linear behaviour in log-log space. Smoothly broken power laws allow us to parametrically measure this transition and express scaling improvements as a function of (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We hypothesize and validate \textit{zero-sum learning} (ZSL) as a mechanistic explanation of loss deceleration that sheds new light on how scaling improves loss by mitigating this transition. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where loss can't improve on one set of tokens without degrading on another; bottlenecking the rate at which overall loss can improve. In contrast to previous work on explaining scaling laws, ZSL is grounded in training dynamics and might potentially be targeted directly to improve loss independent of scale.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training, scaling, continual learning
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5861
Loading