Power of Sign: High Probability Bounds Under $(L_0, L_1)$-smoothness and Heavy-Tailed Noise

Nikita Maksimovich Kornilov; Philip Zmushko; Andrei Semenov; Mark Ikonnikov; Alexander Gasnikov; Aleksandr Beznosikov

Power of Sign: High Probability Bounds Under $(L_0, L_1)$-smoothness and Heavy-Tailed Noise

Nikita Maksimovich Kornilov, Philip Zmushko, Andrei Semenov, Mark Ikonnikov, Alexander Gasnikov, Aleksandr Beznosikov

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Heavy-tailed noise, SignSGD, High Probability bounds, Generalized Smoothness

Abstract: In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $\kappa$-th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization. In theory, we prove the first-known high probability convergence bounds under $(L_0, L_1)$-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, $\texttt{SignSGD}$ with batching achieves sample complexity $\tilde{O}\left(\left(\frac{\Delta L_0}{\varepsilon^2} + \frac{\Delta L_1}{\varepsilon}\right)\left[1 + \left(\frac{\sigma}{\varepsilon}\right)^\frac{\kappa}{\kappa-1}\right]\right), \kappa \in (1,2]$. Under the assumption of symmetric noises, $\texttt{SignSGD}$ with Majority Voting can robustly work on the whole range of $\kappa \in (0,2]$ with complexity $\tilde{O}\left(\left(\frac{\Delta L_0}{\varepsilon^2} + \frac{\Delta L_1}{\varepsilon}\right)\left[\frac{1}{\kappa^2} + \frac{\sigma^2}{\varepsilon^2}\right]\right)$. We also obtain results for parameter-free methods, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.

Primary Area: optimization

Submission Number: 20251

Loading