Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: optimization, neural network, language model, self-attention mechanism, backpropagation
TL;DR: We derived the explicit solution mathematically to train transformers with ``plug and chug'' computations.
Abstract: Iterative differential approximation methods that rely upon backpropagation have enabled the optimization of neural networks; however, at present, they remain computationally expensive, especially when training models at scale. In this paper, we present a computationally efficient alternative for optimizing neural networks that can both reduce the costs of scaling neural networks and provide high-efficiency optimizations for low-resource applications. This paper will discuss how we derive a general result about feed-forward neural networks, and then extend this solution to compositional (mult-layer) networks, which we then apply to a simplified transformer block, containing both feed-forward and self-attention layers. These developments lead us to train highly-specified and complex multi-layer neural architectures from that we refer to descriptively as self-attentive feed-forward unit (SAFFU) layers, which we apply to our development of a hyper-efficient single-block transformer, which achieves generalization over only a small—cognitively-feasible—volume of data. Results from testing demonstrate explicit solutions grossly outperform those optimized by backpropagation alone, and moreover, that further application of backpropagation after explicit solutions leads to the discovery of better optima from smaller scales of data, i.e., that training highly-performant models from much smaller scales of data is enabled by warm starting models with their explicit solutions. Using the efficiency and consistency of the SAFFU's explicit solution, we carry out ablation experiments training a roadmap of approximately 500 single-block transformer models over $1$-million tokens, each, to determine ideal hyperparamterizations for the SAFFU-based transformer. We find that multiple different architectural variants of the SAFFU-transformer are capable of highly-performant and well-generalized models. Most critically, we discover from this ablation that some of the most performant models are in fact not the most highly parameterized. In other words, we discover not only that generalized models can be reached more efficiently (using less data) than without explicit solutions, but moreover, that the architectural exploration afforded by explicit solution efficiency also pays dividends in guiding our search for more efficient architecture, containing fewer parameters, which could be incorporated into low-resource hardware where AI might be embodied.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6161
Loading