Exponential Objective Decrease in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fast Initial Convergence Explained, Gradient Descent Method Variants, $(L_0, L_1)$-Smoothness
Abstract: The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called $(L_0,L_1)$-smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that $(L_0,L_1)$-GD has the following behavior in the _convex setup_: as long as $||\nabla f(x^k)|| \geq \frac{L_0}{L_1}$ the algorithm shows _exponential objective decrease_, and when $||\nabla f(x^k)|| < \frac{L_0}{L_1}$ is satisfied, $(L_0,L_1)$-GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: _Normalized Gradient Descent_ as well as _Clipped Gradient Descent_ (the case when the full gradient $\nabla f(x)$ is available); _Random Coordinate Descent_ (when the gradient component $\nabla_{i} f(x)$ is available); _Random Coordinate Descent with Order Oracle_ (when only $\text{sign} [f(y) - f(x)]$ is available). In addition, we also extend our analysis of $(L_0,L_1)$-GD to the strongly convex case. We explicitly confirm our theoretical results through numerical experiments.
Supplementary Material: pdf
Primary Area: optimization
Submission Number: 8926
Loading