Toward a Unified Theory of Gradient Descent under Generalized Smoothness

Alexander Tyurin

Toward a Unified Theory of Gradient Descent under Generalized Smoothness

Alexander Tyurin

Published: 01 May 2025, Last Modified: 13 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the classical optimization problem $\min_{x \in \mathbb{R}^d} f(x)$ and analyze the gradient descent (GD) method in both nonconvex and convex settings. It is well-known that, under the $L$–smoothness assumption ($\|\| \nabla^2 f(x) \|\| \leq L$), the optimal point minimizing the quadratic upper bound $f(x_k) + \langle \nabla f(x_k), x_{k+1} - x_k \rangle + \frac{L}{2} \|\| x_{k+1} - x_k \|\|^2$ is $x_{k+1} = x_k - \gamma_k \nabla f(x_k)$ with step size $\gamma_k = \frac{1}{L}$. Surprisingly, a similar result can be derived under the $\ell$-generalized smoothness assumption ($\|\| \nabla^2 f(x) \|\| \leq \ell( \|\| \nabla f(x) \|\| )$). In this case, we derive the step size $$\gamma_k = \int_{0}^{1} \frac{d v}{\ell( \|\| \nabla f(x_k) \|\| + \|\| \nabla f(x_k) \|\| v)}.$$ Using this step size rule, we improve upon existing theoretical convergence rates and obtain new results in several previously unexplored setups.

Lay Summary: We consider the most classical and fundamental problem in optimization: minimizing a function $f$. This problem arises across a wide range of domains, including physics, economics, engineering, and, notably, machine learning (ML) and artificial intelligence (AI), where the training of new models reduces to optimization problems. Under the classical assumption that $f$ is $L$-smooth, these problems have been extensively studied, numerous textbooks and thousands of research papers have been devoted to it. However, this assumption is often overly restrictive: even simple functions such as $- \log x$ and $- \sqrt{1 - x}$ violate it. Moreover, many modern ML and AI problems do not satisfy $L$-smoothness. In this work, we adopt a more general $\ell$-smoothness assumption, which includes a significantly broader class of functions. Under this assumption, we develop a new optimization method, establish novel theoretical guarantees, and derive state-of-the-art convergence rates that improve the previous results.

Primary Area: Optimization

Keywords: generalized smoothness, first-order optimization, nonconvex optimization, convex optimization

Submission Number: 1073

Loading