Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer; Associative Memory; Energy-Based Model
TL;DR: We introduce a novel approach to examining the behavior of Transformers by leveraging the framework of associative memory.
Abstract: Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we use a distance-based energy function to approximate the one in the modern continuous Hopfield network, which provides an insightful explanation for the attention mechanism. Since the softmax function corresponds to the gradient of the LogSumExp function in the energy, using the majorization-minimization technique, we construct a global energy function to capture the layered architecture. We show a dependency between the model size and the dataset for the model to attain optimal performance, and the achievable cross-entropy loss is bounded below.
Supplementary Material: zip
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5492
Loading