High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Ashok Cutkosky; Harsh Mehta

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Ashok Cutkosky, Harsh Mehta

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 OralReaders: Everyone

Keywords: stochastic optimization, heavy tails, non-convex optimization, optimization for deep learning

TL;DR: We show that combining momentum, normalization, and gradient clipping allows for high-probability convergence guarantees in non-convex stochastic optimization even in the presence of heavy-tailed gradient noise.

Abstract: We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable "burn-in" period, the objective value will monotonically decrease for every iteration until a critical point is identified, which provides intuition behind the popular practice of learning rate "warm-up'' and also yields a last-iterate guarantee.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

9 Replies

Loading