Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Jingfeng Wu; Peter Bartlett; Matus Telgarsky; Bin Yu

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show the statistical benefits of early stopping in GD for high-dimensional logistic regression

Abstract: In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution---a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

Lay Summary: In machine learning problems like overparameterized logistic regression, running the standard training method (gradient descent, or GD) for too long can lead to unreliable predictions and increasing errors. This paper demonstrates that simply stopping GD training early offers significant advantages. Early-stopped GD achieves vanishing errors and good calibration, unlike models trained to completion which can be inconsistent. Furthermore, early stopping allows models to reach high accuracy with a manageable (polynomial) amount of data, whereas fully trained or interpolating models often require impractical (exponential) amounts. This research also reveals that early stopping acts as an effective "implicit regularization," with the path taken by early-stopped GD closely mirroring models trained with the well-known $\ell_2$-regularization technique.

Primary Area: Theory->Deep Learning

Keywords: Implicit Regularization, GD, Early Stopping, Logistic Regression, Overparameterization

Submission Number: 12743

Loading