Adagrad Promotes Diffuse Solutions In Overparameterized Regimes

Published: 26 Oct 2023, Last Modified: 13 Dec 2023NeurIPS 2023 Workshop PosterEveryoneRevisionsBibTeX
Keywords: Optimization, Adagrad, Over-parameterization
TL;DR: We present numerical evidence that solutions outputted by Adagrad in Over-parameterized least-squares with small enough step-size have entries which are close in magnitude.
Abstract: With the high use of over-parameterized data in deep learning, the choice of optimizer in training plays a big role in a model's generalization ability due to solution selection bias. This work focuses on the adaptive gradient optimizer Adagrad, in the over-parameterized least-squares regime. We empirically find that when using sufficiently small step sizes, Adagrad promotes diffuse solutions in the sense of uniformity among the coordinates of the solution. Additionally, we theoretically show that Adagrad's solution, under the same conditions, exhibits greater diffusion compared to the solution obtained through gradient descent (GD) by analyzing the ratio of their updates. Lastly, we empirically compare the performance of Adagrad and GD on generated datasets. We observe a consistent trend that Adagrad promotes more diffused solutions, which aligns with our theoretical analysis.
Submission Number: 37