The Implicit Bias of Gradient Descent on Separable Data

Daniel Soudry; Elad Hoffer; Mor Shpigel Nacson; Nathan Srebro

The Implicit Bias of Gradient Descent on Separable Data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Nathan Srebro

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: We show that gradient descent on an unregularized logistic regression problem, for almost all separable datasets, converges to the same direction as the max-margin solution. The result generalizes also to other monotone decreasing loss functions with an infimum at infinity, and we also discuss a multi-class generalizations to the cross entropy loss. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.

TL;DR: The normalized solution of gradient descent on logistic regression (or a similarly decaying loss) slowly converges to the L2 max margin solution on separable data.

Keywords: gradient descent, implicit regularization, generalization, margin, logistic regression, loss functions, optimization, exponential tail, cross-entropy

Code: [![github](/images/github_icon.svg) paper-submissions/MaxMargin](https://github.com/paper-submissions/MaxMargin) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=r1q7n9gAb)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/the-implicit-bias-of-gradient-descent-on/code)

7 Replies

Loading