Large Learning Rates without the Agonizing Pain: Dispelling the Curse of Singularities in Deep Neural Networks

Hengjie Cao; Yifeng Yang; Mengyi Chen; Ruijun Huang; Fang Dong; Jixian Zhou; Mingzhi Dong; Yujiang Wang; Dongsheng Li; David A. Clifton; Robert P. Dick; Qin Lv; Fan Yang; Tun Lu; Ning Gu; Li Shang

Large Learning Rates without the Agonizing Pain: Dispelling the Curse of Singularities in Deep Neural Networks

Hengjie Cao, Yifeng Yang, Mengyi Chen, Ruijun Huang, Fang Dong, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Dongsheng Li, David A. Clifton, Robert P. Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu, Li Shang

26 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: learning rate, training stability, parametric singularity

Abstract: Employing large learning rates (LRs) in deep learning can accelerate convergence and improve generalization, but it can also cause training instability and loss explosion: determining an appropriate LR is an often laborious and painful art. Our study into the fine-grained behaviors of parametric singularities, specifically the stable ranks of weight matrices of network components, reveals a strong connection between these singularities and training instability. As training progresses, parametric singularities trend upward, a phenomenon that is directly aggravated by large LRs. Crucially, several training steps before prominent instabilities such as gradient explosions, we observe unusually high parametric singularities across the network components, leading to rank-deficient representations. These representations, in turn, amplify parametric singularities during backpropagation, creating a vicious cycle that eventually results in loss explosions. We refer to this phenomenon as \textit{the curse of singularities}. Building on this understanding, we propose a lightweight and robust stabilization method called Parametric Singularity Smoothing (PSS), which allows for early intervention and mitigates impending instability by smoothing the singular spectra of weight matrices, thereby preventing the curse of singularities. This approach is easy to implement, works at any stage of training by restoring stable training even after instability, has neglectable computational overhead, and, most importantly, frees us from the painful LR fine-tunings to avoid instabilities. Experimental results across various datasets, networks, and optimizers demonstrate that our approach allows a 5-10$\times$ increase in LR without producing instability, attaining better training efficiency and generalization. We release our code for everyone to use our methods and reproduce the experiments, available at https://anonymous.4open.science/r/ICLR_stability-C69C.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7875

Loading