Modality-Aware Adaptation of Contrastive Language-Image ModelsDownload PDF

Published: 04 Mar 2023, Last Modified: 16 May 2023ME-FoMo 2023 PosterReaders: Everyone
Keywords: Foundation Models, Vision-Language, CLIP, Few Shot Learning, CLIP Adapter
TL;DR: By considering the structure of the CLIP latent space, we can efficiently adapt CLIP to few-shot datasets, in some cases without labeled samples.
Abstract: Despite their high levels of robustness, Contrastive Language-Image Models (CLIP) still require some form of downstream adaptation when applied to tasks sufficiently out-of-domain with respect to their training set. Recent methods propose light-weight adapters on the model features, primarily focused on the few-shot domain. All such approaches however, require per-task hyperparameter tuning which necessitates access to a validation set; limiting their applicability in practice. As an alternative, we propose Modality Aware Tangent-space Retrieval (MATeR), a training-free, interpretable adapter which outperforms all recent methods when per-task hyperparameter tuning is prohibited. MATeR considers the manifold formed by CLIP embeddings when incorporating out of domain few-shot class information and its predictions are invariant to the modality gap; representing the first approach that considers the geometric structure of the CLIP latent space to inform downstream task adaptation. Additionally, we demonstrate a variant of MATeR has the ability to significantly increase zero-shot accuracy with only a handful of unlabelled images, much lower than the number of classes.
0 Replies

Loading