From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler; Ilyas Fatkhullin; Niao He

From Gradient Clipping to Normalization for Heavy Tailed SGD

Florian Hübler, Ilyas Fatkhullin, Niao He

Published: 22 Jan 2025, Last Modified: 11 Mar 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sample complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues and motivated by practical observations, we make the connection of gradient clipping to its close relative --- Normalized SGD (NSGD) --- and study its convergence properties. First, we establish a parameter-free sample complexity for NSGD of $\mathcal{O}\left(\varepsilon^{-\frac{2p}{p-1}}\right)$ to find an $\varepsilon$-stationary point, only assuming a finite $p$-th central moment of the noise, $p\in(1,2]$. Furthermore, we prove the tightness of this result, by providing a matching algorithm-specific lower bound. In the setting where all problem parameters are known, we show this complexity is improved to $\mathcal{O}\left(\varepsilon^{-\frac{3p-2}{p-1}}\right)$, matching the previously known lower bound for all first-order methods in all problem dependent parameters. Finally, we establish high-probability convergence of NSGD with a mild logarithmic dependence on the failure probability. Our work complements the studies of gradient clipping under heavy-tailed noise, improving the sample complexities of existing algorithms and offering an alternative mechanism to achieve high-probability convergence.

Submission Number: 832

Loading