Keywords: Continual Learning, Catastrophic Forgetting, LLMs, MLP, Feed-Forward Layers
Abstract: Large Language Models (LLMs) face the challenge of catastrophic forgetting in continual learning scenarios, where learning new tasks often overwrites previously acquired knowledge, leading to performance degradation and limiting their applicability in dynamic task environments. Existing approaches can be categorized into rehearsal-based, regularization-based, and architecture-based methods. Among these, architecture-based methods are more suitable for LLMs as they dynamically adjust model structures to handle large-scale parameters and task interference. However, existing methods often struggle with parameter efficiency and fail to fully leverage the Transformer architecture's characteristics.
In this work, we propose Branching Memory, a novel method that leverages the organization of knowledge within transformer models. By modeling knowledge as key-value (KV) representations within the FFN layers, our approach dynamically allocates dedicated capacity for new tasks, allowing the model to store and integrate task-specific knowledge without overwriting existing information. To further improve knowledge retention and reduce task interference, we employ an orthogonality-based regularization strategy to stabilize training and minimize parameter conflicts.
Experimental results on standard continual learning benchmarks demonstrate that Branching Memory achieves superior performance with enhanced parameter efficiency. On short-sequence tasks with T5-Large, Branching Memory with regularization achieves 76.6\% average accuracy, outperforming baseline methods. Extended evaluations on LLaMA2-7B and 15-task long sequences validate the method's scalability and effectiveness across different model architectures and task lengths. The method's practical advantage lies in its balanced trade-off between performance, parameter efficiency, and inference simplicity in continual learning scenarios.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 5073
Loading