Asymmetric Momentum: A Rethinking of Gradient Descent

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Gradient Descent, Optimizer, Machine Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A Revolution of Gradient Descent.
Abstract: Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5817
Loading