Large Learning Rate Tames Homogeneity: Convergence and Balancing EffectDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 PosterReaders: Everyone
Keywords: large learning rate, gradient descent, matrix factorization, implicit regularization, convergence, balancing, alignment
Abstract: Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory for constant large learning rates well beyond $2/L$, where $L$ is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed `balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.
One-sentence Summary: Large learning rate well beyond 2/L provably induces an implicit regularization effect of balancing in gradient descent for matrix factorization.
Supplementary Material: zip
18 Replies