In-Context Denoising with One-Layer Transformers: Connections between Attention and Associative Memory Retrieval
TL;DR: We show that one-layer transformers perform optimal in-context denoising through a single step of context-dependent associative memory inference.
Abstract: We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.
Lay Summary: Large language models can often solve new tasks after seeing only a few examples in their prompt, a skill known as *in-context learning*. However, researchers still don’t fully understand how this process works under the hood.
In our work, we studied a version of this problem where a model is asked to recover a clean example from a corrupted one using similar, uncorrupted examples as context. We showed that even a very simple model—a transformer with just one attention layer—can do this remarkably well. In fact, its behavior closely matches how certain mathematical models of memory retrieval work, revealing a surprising connection between in-context learning and associative memory.
This discovery helps explain why transformer models are so adaptable. It builds on earlier work linking attention and memory, and shows that the same ideas help explain how pre-trained models can adapt to new tasks using only examples in their input. Our results offer a clearer picture of how these systems generalize, and could inform the design of simpler, more interpretable machine learning systems.
Link To Code: https://github.com/mattsmart/in-context-denoising
Primary Area: Deep Learning->Attention Mechanisms
Keywords: attention, in-context learning, denoising, associative memory, Hopfield network, transformers
Submission Number: 9353
Loading