Abstract: Current deep learning models often catastrophically forget the knowledge of old classes when continually learning new ones. State-of-the-art approaches to continual learning of image classes often require retaining a small subset of old data to partly alleviate the catastrophic forgetting issue, and their performance would be degraded sharply when no old data can be stored due to privacy or safety concerns. In this study, inspired by human learning of visual knowledge with the effective help of language, we propose a novel continual learning framework based on a pre-trained vision-language model (VLM) without retaining any old data. Rich prior knowledge of each new image class is effectively encoded by the frozen text encoder of the VLM, which is then used to guide the learning of new image classes. The output space of the frozen text encoder is unchanged over the whole process of continual learning, through which image representations of different classes become comparable during model inference even when the image classes are learned at different times. Extensive empirical evaluations on multiple image classification datasets under various settings confirm the superior performance of our method over existing ones. The source code is available at https://github.com/Fatflower/CIL_LG_VLM/.
Loading