Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

ICLR 2025 Conference Submission5492 Authors

26 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer; Associative Memory; Energy-Based Model
TL;DR: We introduce a novel approach to examining the behavior of Transformers by leveraging the framework of associative memory.
Abstract: Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we use a distance-based energy function to approximate the one in the modern continuous Hopfield network, which provides an insightful explanation for the attention mechanism. Since the softmax function corresponds to the gradient of the LogSumExp function in the energy, using the majorization-minimization technique, we construct a global energy function to capture the layered architecture. We show a dependency between the model size and the dataset for the model to attain optimal performance, and the achievable cross-entropy loss is bounded below.
Supplementary Material: zip
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5492
Loading