Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Abulikemu Abuduweili; Changliu Liu

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Abulikemu Abuduweili, Changliu Liu

Published: 10 Oct 2024, Last Modified: 07 Dec 2024NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Learning, Optimization, Adaptive Gradient Descent, Initialization

TL;DR: In this work, we identify the standard zero initialization of the second-order moment as a key limitation and propose simple non-zero initialization strategies.

Abstract: Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ($v_0 =0$) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers.

Submission Number: 74

Loading