Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We prove that AdaGrad and Adam fail to achieve good high-probability convergence guarantees and show how gradient clipping fixes this issue under the heavy-tailed noise assumption.
Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.
Lay Summary: Modern AI models, including large language models like ChatGPT and BERT, are trained using optimization techniques that adjust the model based on noisy feedback. Two popular techniques for this are Adam and AdaGrad, which adapt their learning speed over time. However, in many real-world cases, the noise in the feedback is unpredictable and extreme — known as heavy-tailed noise — which can cause these methods to behave unreliably. In our study, we show that without any fixes, Adam and AdaGrad can perform poorly under such noisy conditions. To address this, we apply a technique called gradient clipping, which limits extreme updates and helps stabilize learning. We provide new mathematical guarantees that clipped versions of Adam and AdaGrad work reliably even when the noise is heavy and erratic. We also show through experiments — including fine-tuning real-world language models — that clipping consistently improves their performance. Our findings suggest that using clipped adaptive methods can make AI training not only faster but also more reliable when the training environment is noisy and unpredictable — a common situation in today’s large-scale AI systems.
Link To Code: https://github.com/yaroslavkliukin/Clipped-AdaGrad-and-Adam
Primary Area: Optimization->Stochastic
Keywords: stochastic optimization, heavy-tailed noise, adaptive methods, gradient clipping, high-probability convergence bounds
Submission Number: 964
Loading