Keywords: Implicit bias, line-search, Polyak step size
TL;DR: Max-margin convergence rate of gradient descent with Polyak and line-search step Sizes on separable Data
Abstract: Recent works have shown that Polyak and line-search step sizes are good for training deep neu-
ral nets. However, a theoretical understanding of their generalization performances is lacking.
For overparameterized models, multiple solutions can generalize differently to unseen data despite
all obtaining zero train error. Given this, a natural question is whether an algorithm inherently
prefers (without explicit regularization) certain simple solutions over others upon convergence-a
phenomenon known as implicit bias/regularization. In this work, we characterize the implicit bias
of gradient descent with Polyak and line-search step sizes in linear classification with the logis-
tic or cross-entropy loss. Given these step sizes are adaptive to local smoothness of the loss, we
prove that the margin of their iterates converges to the maximum $l_2$-norm margin at $\tilde{O}(\frac{1}
{T})$ rate. In contrast to other adaptive step sizes that achieve the same rate [7] (also known as normalized
gradient descent-NGD), line-search and Polyak step sizes do not depend on problem-specific con-
stants that may not be accessible. Another subtle issue is that NGD can diverge on common losses
with non-separable data, whereas line-search converges given it guarantees descent on the function
value at each iteration. Finally, our analysis extends the game framework of Wang et al. [26] to
logistic/cross-entropy losses.
Submission Number: 26
Loading