Reducing the Need for Backpropagation and Discovering Better Optima With Explicit Optimizations of Neural Networks

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: optimization, neural network training, deep learning, backpropagation, natural language processing, image classification, computer vision
TL;DR: Optimizes single and multi-layer neural networks for both language modeling and image classification tasks without the use of iterative, gradient-based approximation.
Abstract: Iterative differential approximation methods that rely upon backpropagation have enabled the optimization of neural networks; however, at present, they remain computationally expensive, especially when training models at scale. In this paper, we propose a computationally efficient alternative for optimizing neural networks that can both reduce the costs of scaling neural networks and provide high-efficiency optimizations for low-resource applications. We derive an explicit solution to a simple feed-forward language model by analyzing the continuous bag-of-words (CBOW) form of word2vec. We conjecture that CBOW's optimized parameters are proportional to an existing explicit solution to word2vec's \textit{skip-gram} form, and find that only slight modification is needed to convert the skip-gram optimization into the CBOW optimization. This solution generalizes from CBOW to the class of \textit{all} single-layer feed-forward softmax-activated neural models trained on positive-valued features, as is demonstrated by our application of this solution—without additional analysis—across both MNIST digit classification and language modeling. We perform computational experiments that demonstrate the near-optimality of our solution by showing that 1) iterative optimization only marginally improves the explicit solution's parameters and 2) randomly initialized parameters iteratively optimize towards the explicit solution's loss. We finally apply the explicit solution locally to multi-layer networks and discuss how the solution's computational savings increase with model complexity, and moreover, how better optima—than _cannot_ be achieved by backpropagation alone—are discovered _only after_ the multi-layer warm-start is applied.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7828
Loading