Memory Layers at Scale

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Provides scaling laws for LLMs with memory layers, showing imroved performance and inference speed at similar training FLOP and memory cost compared to MOE layers.
Abstract: Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
Lay Summary: Modern language models are getting larger and more expensive to run, as they rely heavily on complex computations to generate accurate responses. We asked whether there’s a smarter way to improve performance—without making models significantly slower or more costly. We introduced a new type of model component called a memory layer, which helps the model “remember” useful information using a lookup system, similar to how a person might check notes instead of trying to recall everything from memory. These layers don’t require much extra computation and can be natively integrated with existing parts of the model. We tested memory layers at a much larger scale than before and found they worked especially well on fact-based tasks, like recalling knowledge or answering trivia. In many cases, models with memory layers outperformed others that used over twice the computational power. To help the community, we’ve released an efficient version of this memory system and show how it can scale to billions of memory entries—paving the way for faster, smarter AI systems.
Link To Code: https://github.com/facebookresearch/memory
Primary Area: Deep Learning->Large Language Models
Keywords: memory, scaling laws, llm, moe, factuality
Submission Number: 7494
Loading