Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

ICLR 2026 Conference Submission22316 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Convex Optimization, Heavy-Tailed Noise, Gradient Clipping

Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$ has been recognized to be more realistic (say being upper bounded by $\sigma_{\mathfrak{l}}^{\mathfrak{p}}$ for some $\sigma_{\mathfrak{l}}\geq0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate $\mathcal{O}(\sigma_{\mathfrak{l}}\ln(1/\delta)T^{\frac{1}{\mathfrak{p}}-1})$ (resp. $\mathcal{O}(\sigma_{\mathfrak{l}}^{2}\ln^{2}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $\delta\in\left(0,1\right]$ is the failure probability and $T\in\mathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, $\mathcal{O}(\sigma_{\mathfrak{l}}d_{\mathrm{eff}}^{-\frac{1}{2\mathfrak{p}}}\ln^{1-\frac{1}{\mathfrak{p}}}(1/\delta)T^{\frac{1}{\mathfrak{p}}-1})$ and $\mathcal{O}(\sigma_{\mathfrak{l}}^{2}d_{\mathrm{eff}}^{-\frac{1}{\mathfrak{p}}}\ln^{2-\frac{2}{\mathfrak{p}}}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2})$, than the aforementioned best results, where $d_{\mathrm{eff}}\geq1$ is a quantity we call the generalized effective dimension. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to the convergence in expectation and obtain new rates that break the known lower bounds.

Primary Area: optimization

Submission Number: 22316

Loading