Gradient Clipping Helps in Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Eduard Gorbunov; Marina Danilova; Innokentiy Andreevich Shibaev; Pavel Dvurechensky; Alexander Gasnikov

Gradient Clipping Helps in Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Eduard Gorbunov, Marina Danilova, Innokentiy Andreevich Shibaev, Pavel Dvurechensky, Alexander Gasnikov

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: convex optimization, stochastic optimization, non-smooth optimization, heavy tails, clipping, SGD, high-probability bounds

TL;DR: The first logarithmically dependent on the confidence level high-probability complexity bounds for non-smooth convex stochastic optimization with heavy-tailed noise

Abstract: Thanks to their practical efficiency and random nature of the data, stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice, e.g., in several NLP tasks. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmical dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Holder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: zip

22 Replies

Loading