Teacher-Forced Selective Self-Distillation for Uncurated Replay Data

Teacher-Forced Selective Self-Distillation for Uncurated Replay Data

ACL ARR 2026 January Submission7939 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual learning, fine-tuning

Abstract: Continual fine-tuning involves incrementally training a language model to acquire knowledge of new tasks. This learning paradigm introduces the challenge of catastrophic forgetting, where models tend to forget previously learned tasks as they adapt to new ones. Several techniques have been proposed to address this issue, including regularization, parameter-isolation, and replay-based approaches. Among these, replay-based methods have gained wider adoption due to their less invasive nature and ease of integration into existing continual learning pipelines. However, in real-world settings, curating ideal replay samples is a practical challenge for replay-based methods. This leads to the use of noisy replay data (either synthetic generated from exemplars or provided by task owner), which results in suboptimal task performance. To address this crucial real-world challenge, we introduce Teacher-Forced Selective Self-Distillation (TF-SSD) a novel method that employs self-distillation of the labels from the task stage model and refine the less effective samples using mixture of teachers framework. Our experiments involving challenging 16 task continual learning setting demonstrate that TF-SSD outperforms best-performing baseline by $\sim$2.7 points in task performance and $\sim$2.8 points in mitigating catastrophic forgetting across $2$ model families: Llama2 7B and Granite3.3 2B. We plan to open-source the code of TF-SSD.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Continual learning, fine-tuning

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7939

Loading