Understanding Language Model Scaling Laws in Terms of Training Dynamics via Loss Deceleration and Zero-Sum Learning

Understanding Language Model Scaling Laws in Terms of Training Dynamics via Loss Deceleration and Zero-Sum Learning

ACL ARR 2025 February Submission5861 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We find that these can be tied back to loss deceleration: an abrupt slowdown in the rate of loss improvement early in training, characterized by piece-wise linear behaviour in log-log space. Smoothly broken power laws allow us to parametrically measure this transition and express scaling improvements as a function of (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We hypothesize and validate \textit{zero-sum learning} (ZSL) as a mechanistic explanation of loss deceleration that sheds new light on how scaling improves loss by mitigating this transition. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where loss can't improve on one set of tokens without degrading on another; bottlenecking the rate at which overall loss can improve. In contrast to previous work on explaining scaling laws, ZSL is grounded in training dynamics and might potentially be targeted directly to improve loss independent of scale.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: pre-training, scaling, continual learning

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 5861

Loading