Learning to Remember, Learn, and Forget in Attention-Based Models

Learning to Remember, Learn, and Forget in Attention-Based Models

ICLR 2026 Conference Submission19853 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention Based Model, In-Context Learning, Continual Learning, Bayesian Inference, Metaplasticity.

TL;DR: A Bayesian continual-learning framework for attention based model balances plasticity and stability, to achieve superior in-context learning.

Abstract: The ability to perform learning during inference, i.e. in-context learning (ICL) is a core feature of self-attention in transformers. ICL acts like an online associative memory and is believed to underpin transformers' capabilities in complex sequence processing tasks. In some cases, ICL was shown to simulate online gradient descent of a local loss function on an input sequence. In this work, we view ICL as a continual learning problem that may suffer from memory interference and requires a solution to a plasticity--stability dilemma. We examine here the memory consolidation properties of ICL and propose a Bayesian continual learning framework to solve this dilemma, leading to a new attention model. Our framework builds on the idea of metaplasticity in neuroscience, where the level of plasticity of each synapse is tied to an importance measure grounded by a Bayesian prior distribution capturing previously learned knowledge. Our approach explains several gated linear attention models in the literature, identifying the respective assumptions from a Bayesian learning perspective. Furthermore, our Bayesian continual learning approach provides a principled approach to forgetting, enabling the design of attention layers with a desired memory horizon. Our experiments achieve competitive performances on synthetic benchmarks. Additionally, we experiment on several commonsense reasoning benchmarks where small models benefit from consolidated synapses, outperforming strong baseline like Gated Delta Networks.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 19853

Loading