Gradient Dissent in Language Model Training and Saturation

Published: 16 Jun 2024, Last Modified: 19 Jul 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language model, learning dynamics, opposing gradients, gradient starvation
Abstract: We seek to shed light on language model (LM) saturation from the perspective of learning dynamics. To this end, we define a decomposition of the cross-entropy gradient, which forms a shared low-dimensional basis for analyzing the training dynamics of models across scales. Intuitively, this decomposition consists of attractive and repulsive components that increase the logit of the correct class and decrease the logits of incorrect classes, respectively. Our analysis in this subspace reveals a phenomenon we term \textit{gradient dissent}, characterized by gradient components becoming systematically opposed such that loss cannot be improved along one component without being degraded along the other. Notably, we find that complete opposition, which we term \textit{total dissent}, reliably occurs in tandem with the saturation of smaller LMs. Based on these results, we hypothesize that gradient dissent can provide a useful foundation for better understanding and mitigating saturation.
Student Paper: Yes
Submission Number: 57
Loading