Exploring Continual Distillation of Teachers from Different Domains

Published: 23 Sept 2025, Last Modified: 11 Nov 2025CCFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Continual Distillation, Domain Shift, Continual Learning, Domain Forgetting
TL;DR: We propose Continual Distillation, where a student learns from sequential teachers without access to prior ones and highlights forgetting on domains unseen by the student.
Abstract: With Foundation Models (FM) training costs rising to unprecedented heights, Continual Learning (CL) is a particularly compelling training paradigm that helps minimize the training cost of FM by incorporating new data incrementally instead of re-training from scratch. Instead of learning from a sequence of data suffering from domain shift, we propose \textbf{Continual Distillation (CD)}, a new paradigm where a single student model learns continuously from a sequence of teachers. Similar to CL, re-distilling from all teachers when introducing a new teacher is unsustainable, or even impossible when depending on a third-party provider. Therefore, when learning from one teacher, other teachers are considered unavailable. We show that CD naturally suffers from catastrophic forgetting, as knowledge distilled from the earlier teachers is forgotten when learning from subsequent teachers. Moreover, we find that the choice of the distillation data plays a central role, and even data unrelated to the teacher's original training domain can serve as an effective medium of knowledge transfer. This property has significant implications for FMs, where the original training dataset is often unavailable, undisclosed, or prohibitively large to reuse. While CD alleviates dependence on the original data, it also introduces security concerns, as a student may inadvertently absorb undesired knowledge without conscious control. Our study establishes CD as a new direction for leveraging foundation models in a resource-constrained environment.
Serve As Reviewer: ~Nicolas_Michel1
Submission Number: 34
Loading