A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Multi-Adapters

Haihua Luo; Xuming Ran; Dianbo Liu; Jiangrong Shen; Qi Xu; Fengyu Cong

A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Multi-Adapters

Haihua Luo, Xuming Ran, Dianbo Liu, Jiangrong Shen, Qi Xu, Fengyu Cong

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual learning; CLIP; Adapter

Abstract: Incremental Learning (IL) aims to learn new tasks while preserving previous knowledge. Integration of the zero-shot learning capabilities of pre-trained vision-language models into IL methods has been a significant advancement. However, these methods face three primary challenges: 1) the need for improved training efficiency; 2) the need for a memory bank to store previous data; and 3) the need for a strong backbone to augment the model’s capabilities. In this paper, we propose the $\textbf{SimE}$ that is a $\textbf{Sim}$ple $\textbf{E}$fficiency framework which is the vision-language model with an adapter designed for solving the IL task. We report a remarkable phenomenon that there is not always a direct positive correlation between the number of adaptive adapter connections and the model's incremental learning (IL) capabilities. While increasing the number of adapter connections between transformer blocks positively impacts model performance, within transformer blocks, adding more adaptive connections in smaller incremental stages does not enhance, and may even degrade the model's IL ability. Such improvement only occurs at advanced incremental stages. Extensive experimental results show SimE surpasses traditional methods by 9.6\% on TinyImageNet and outperforms other CLIP-based methods by 5.3\% on CIFAR-100. Notably, the SimE, with only thousands of parameters and a 0-size memory bank, exceeds the ZSCL with 140 million parameters and also beats the CoOP with a 1000-size memory bank. Besides, we also conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest that the backbone of the encoder in SimE use the image encoder from CLIP that is pre-trained on large datasets, like LAION-2B, and larger model sizes, such as ViT-L/14, for IL tasks.

Primary Area: applications to neuroscience & cognitive science

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6939

Loading