Keywords: Continual learning, fine-tuning
Abstract: Continual fine-tuning involves incrementally training a language model to acquire knowledge of new tasks. This learning paradigm introduces the challenge of catastrophic forgetting, where models tend to forget previously learned tasks as they adapt to new ones. Several techniques have been proposed to address this issue, including regularization, parameter-isolation, and replay-based approaches. Among these, replay-based methods have gained wider adoption due to their less invasive nature and ease of integration into existing continual learning pipelines. However, in real-world settings, curating ideal replay samples is a practical challenge for replay-based methods. This leads to the use of noisy replay data (either synthetic generated from exemplars or provided by task owner), which results in suboptimal task performance. To address this crucial real-world challenge, we introduce Teacher-Forced Selective Self-Distillation (TF-SSD) a novel method that employs self-distillation of the labels from the task stage model and refine the less effective samples using mixture of teachers framework. Our experiments involving challenging 16 task continual learning setting demonstrate that TF-SSD outperforms best-performing baseline by $\sim$2.7 points in task performance and $\sim$2.8 points in mitigating catastrophic forgetting across $2$ model families: Llama2 7B and Granite3.3 2B. We plan to open-source the code of TF-SSD.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Continual learning, fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7939
Loading