On the Role of Momentum in the Implicit Bias of Gradient Descent for Diagonal Linear Networks

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: GD, momentum, implicit bias, linear networks
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We investigate the implicit bias of momentum-based methods for diagonal linear networks.
Abstract: Momentum is a widely adopted and crucial modification to gradient descent when training modern deep neural networks. In this paper, we target on the regularization effect of momentum-based methods in regression settings and analyze a popular proxy model, diagonal linear networks, to precisely characterize the implicit bias of heavy-ball (HB) and Nesterov's method of accelerated gradients (NAG). We show that, HB and NAG exhibit different implicit bias compared to GD for diagonal linear networks, which is different from the one for classic linear regression problem where momentum-based methods share the same implicit bias with GD. Specifically, the role of momentum in the implicit bias of GD is twofold. On one hand, HB and NAG induce extra initialization mitigation effects similar to SGD that are beneficial for generalization of sparse regression. On the other hand, besides the initialization of parameters, the implicit regularization effects of HB and NAG also depend on the initialization of gradients explicitly, which may not be benign for generalization. As a consequence, whether HB and NAG have better generalization properties than GD jointly depends on the aforementioned twofold effects determined by various parameters such as learning rate, momentum factor, data matrix, and integral of gradients. Particularly, the difference between the implicit bias of GD and that of HB and NAG disappears for small learning rate. Our findings highlight the potential beneficial role of momentum and can help understand its advantages in practice from the perspective of generalization.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2200
Loading