TaCA: Hot-Plugging Upgrades for Foundation Model with Task-agnostic Compatible Adapter

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Visual Foundation Model, Compatible Representation Learning, Parameter-Efficient Transfer Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Visual foundation models, such as CLIP, exhibit exceptional proficiency in learning feature representations from extensive datasets via self-supervised techniques, showcasing noteworthy aptitude for transfer learning and generalization. A growing number of applications based on visual foundation models are emerging, including innovative solutions such as BLIP-2. These applications employ pre-trained CLIP models as upstream feature extractors and train various downstream modules to accomplish diverse tasks. However, scenarios necessitating system upgrades that entail updating the foundational model pose challenges, as they entail the inefficient and inflexible process of retraining all downstream modules to align with the new foundational model. In this paper, we propose an innovative and valuable task, Hot-Plugging Upgrades for visual foundation models. The aim is to seamlessly integrate superior-performing foundation models into downstream applications without adjusting the downstream modules. To realize this objective, we introduce a parameter-efficient and task-agnostic Compatible Adapter, referred to as TaCA, which promotes compatibility across distinct foundation models while concurrently enhancing performance for the new models. We conduct extensive experimental validation of TaCA using different scales of models with up to one billion parameters on various tasks such as video-text retrieval, video recognition, and visual question answering. The results consistently affirm the efficacy of TaCA in facilitating hot-plugging upgrades for visual foundation models. Codes and models will be made available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1059
Loading