Abstract: Adaptive gradient methods, notably Adam ~\citep{kingma2014adam, loshchilov2017decoupled}, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers ~\citep{vaswani2017attention, dosovitskiy2020an}. In this paper, we present a novel optimization anomaly called the \emph{Slingshot Effect}, which manifests during extremely late stages of training. We identify a distinctive characteristic of this phenomenon through cyclic phase transitions between stable and unstable training regimes, as evidenced by the cyclic behavior of the norm of the last layer's weights. Although the Slingshot Effect can be easily reproduced in more general settings, it does not align with any known optimization theories, emphasizing the need for in-depth examination.
Moreover, we make a noteworthy observation that Grokking, as reported by ~\citet{power2021grokking}, occurs predominantly during the onset of the Slingshot Effects and is absent without it, even in the absence of explicit regularization. This finding suggests a surprising inductive bias of adaptive gradient optimizers at late training stages, urging a revised theoretical analysis of their origin.
Our study sheds light on an intriguing optimization behavior that has significant implications for understanding the inner workings of adaptive gradient methods.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Added Appendix D to address the revision comment made by the AE for this paper
Assigned Action Editor: ~Abhishek_Kumar1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1328
Loading