Efficient Gradient-Based Algorithm for Training Deep Learning Models With Many Nonlinear Activations

Rafał Wolniak

Efficient Gradient-Based Algorithm for Training Deep Learning Models With Many Nonlinear Activations

Rafał Wolniak

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: deep learning, optimization, deep learning theory, deep neural network

TL;DR: The research paper presents a novel approach based on gradient descent for training deep learning models with many nonlinear activations, providing both theoretical guarantees and promising experimental results.

Abstract: This research paper presents a novel algorithm for training deep neural networks with many nonlinear layers (e.g., 30). The method is based on backpropagation of an approximated gradient, averaged over the range of a weight update. Unlike the gradient, the average gradient of a loss function is proven within this research to provide more accurate information on the change in loss caused by the associated parameter update of a model. Therefore, it may be utilized to improve learning. In our implementation, the efficiently approximated average gradient is paired with RMSProp and compared to the typical gradient-based approach. For the tested deep model with numerous stacked fully-connected layers featuring nonlinear activations on MNIST and Fashion MNIST, the presented algorithm: $\quad$ (a) generalizes better, at least in a reasonable epoch count,$\quad$ (b) in the case of optimal implementation, learning would require less computation time than the gradient-based RMSProp, with the memory requirement of the Adam optimizer,$\quad$ (c) performs well on a broader range of learning rates, therefore it may bring time and energy savings from reduced hyperparameter searches,$\quad$ (d) improves sample efficiency about three times according to median training losses. On the other hand, for a deep sequential convolutional model trained on the IMDB dataset, sample efficiency is improved by about 55%. However, in the case of the tested shallow model, the method performs approximately the same as the gradient-based RMSProp in terms of both training and test loss. The source code is provided at [...].

Supplementary Material: zip

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3633

Loading