Gradient Descent Converges Arbitrarily Fast for Logistic Regression via Large and Adaptive Stepsizes

Ruiqi Zhang; Jingfeng Wu; Peter Bartlett

Gradient Descent Converges Arbitrarily Fast for Logistic Regression via Large and Adaptive Stepsizes

Ruiqi Zhang, Jingfeng Wu, Peter Bartlett

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We prove that gradient descent can converge arbitrarily fast for Logistic Regression on linearly separable data with large and adaptive stepsizes.

Abstract: We analyze the convergence of gradient descent (GD) with large, adaptive stepsizes for logistic regression on linearly separable data. The stepsize adapts to the current risk, scaled by a fixed base stepsize \eta. We prove that once the number of iterates t surpasses a margin-dependent threshold, the averaged GD iterate achieves a risk upper bound of \exp(-\Theta(\eta t)), where \eta can be chosen arbitrarily large. This implies that GD attains \emph{arbitrarily fast} convergence rates via large stepsizes, although the risk evolution might not be monotonic. In contrast, prior adaptive stepsize GD analyses require a monotonic risk decrease, limiting their rates to \exp(-\Theta(t)). We further establish a margin-dependent lower bound on the iteration complexity for any first-order method to attain a small risk, justifying the necessity of the burn-in phase in our analysis. Our results generalize to a broad class of loss functions and two-layer networks under additional assumptions.

Lay Summary: We study how gradient descent (GD) performs with large, adaptive stepsizes when training logistic regression models on linearly separable data. The stepsize changes based on the current loss, scaled by a fixed base stepsize \eta. We prove that after the number of iterations t exceeds a certain threshold, the averaged iterate from GD achieves a small error bounded by \exp(−\Theta(\eta t)) for an arbitrary \eta. This implies that GD attains \emph{arbitrarily fast} convergence rates via large and adaptive stepsizes. Additionally, we show that a minimum number of iterations is necessary for any first-order optimization method to reach a small risk. Our findings also extend to many other loss functions and two-layer neural networks under extra conditions.

Primary Area: Deep Learning->Theory

Keywords: gradient descent, logistic regression, convergence, optimization

Submission Number: 14362

Loading