From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler; Ilyas Fatkhullin; Niao He

From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler, Ilyas Fatkhullin, Niao He

Published: 10 Oct 2024, Last Modified: 07 Dec 2024NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Nonconvex Optimization, Stochastic Optimization, Gradient Clipping, Normalization, Normalized SGD, NSGD, Momentum, High-Probability Guarantee, In-Probability Guarantee, Parameter-Free

Abstract: Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance in this setting both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sampling complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues, we study convergence of Normalized SGD (NSGD). First, we establish a parameter-free sample complexity guarantee for NSGD of $\widetilde{\mathcal{O}}\left(\frac{\Delta_1^4 + L^4}{\varepsilon^4} + \left(\frac \sigma \varepsilon \right)^{\frac{2p}{p-1}}\right)$ to find an $\varepsilon$-stationary point, where $p\in(1,2]$ is the tail index of heavy tailed noise distribution. In the setting where all problem parameters are known, we show this complexity can be improved to $\mathcal{O}\left( \frac{\Delta_1 L}{\varepsilon^2} + \frac{\Delta_1 L}{\varepsilon^2}\left(\frac \sigma \varepsilon\right)^{\frac p {p-1}} \right)$, matching the previously known lower bound for all first-order methods in all problem dependent parameters. Finally, we establish high-probability convergence of NSGD with a mild logarithmic dependence on the failure probability.

Submission Number: 58

Loading