Keywords: continual learning, synthetic data, knowledge distillation, visual-language model
TL;DR: We present a novel caption-guided replay paradigm and distribution distillation for continual learning, which effectively mitigates forgetting and consistently surpasses previous state-of-the-art approaches.
Abstract: Continual learning with vision-language models is challenged by catastrophic forgetting, where the acquisition of new knowledge compromises previously learned information. Generative replay synthesizes past samples to mitigate forgetting, while avoiding the data-privacy risks and heavy storage overhead of directly replaying historical data. However, existing methods often rely on simple class-level prompts, such as class-name with templates, resulting in synthetic images that poorly capture the semantics of original images. To address this, we propose a \textit{caption-guided replay paradigm} that stores instance-level captions generated by a Multi-modal LLM as memory and reconstructs past images using a LoRA-adapted text-to-image model. This approach enables high-fidelity and instance-aware synthetic replay while remaining efficient in storage. In addition to improving replay fidelity, we observe the phenomenon of \textit{feature drift} in continual learning, which refers to pervasive shifts in intermediate representations during sequential training and is only partially addressed by logit distillation. To address this, we introduce a distribution-based distillation method that aligns feature distributions at multiple intermediate layers, effectively suppressing feature drift and enhancing model stability. Extensive experiments under various settings demonstrate that our proposed method consistently outperforms state-of-the-art approaches.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 5155
Loading