Remembering Transformer for Continual Learning

Published: 2025, Last Modified: 07 Jan 2026IJCNN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Conventional neural networks including Transformers encounter the catastrophic forgetting problem during sequential task learning, where learning new tasks interferes with previously learned knowledge. Existing memory replay and regularization methods cannot effectively eliminate interference among different tasks. Soft parameter sharing methods usually necessitate a large amount of additional parameters for learning each task, and task identity information is essential for leveraging task-specific parameters. To this end, we propose Remembering Transformer leveraging an adapter mixtures architecture enhanced by a generative routing mechanism for efficient task retention. In generative routing, input samples are allocated to the most relevant expert adapters based on a reconstruction loss. Moreover, unlike existing studies on soft parameter sharing that do not consider model capacity limitations, we investigate a challenging setting where the number of task-specific parameters is constrained. In particular, we devise an adapter fusion strategy to aggregate resembling experts based on similarity matching and knowledge distillation. Extensive empirical results measured by task accuracy, forgetting rate, and memory footprint, demonstrate that Remembering Transformer significantly enhances knowledge retention without task identity information. The proposed method surpasses various conventional methods with enhanced parameter efficiency in a broad range of incremental learning tasks.
Loading