CLIP-Enhance: Improving CLIP Zero-Shot Classification via von Mises-Fisher Clustering

Mahtab Sandhu; Yann Batiste Pequignot; Samer B. Nashed; Sabyasachi Sahoo; Liam Paull

CLIP-Enhance: Improving CLIP Zero-Shot Classification via von Mises-Fisher Clustering

Mahtab Sandhu, Yann Batiste Pequignot, Samer B. Nashed, Sabyasachi Sahoo, Liam Paull

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: zero-shot classification, multi-modal representation learning, knowledge distillation, CLIP

TL;DR: We improve CLIP zero shot classification by reformulating it as a von Mises-Fisher mixture model learned via self supervision.

Abstract: Contrastive language-image pre-training (CLIP) has revolutionized computer vision by integrating natural language understanding with image analysis, enabling zero-shot classification without prior training on specific classes. However, recent efforts to improve the performance of frozen CLIP models through prompt tuning and adapter mechanisms have introduced additional system complexity and training requirements, thus undermining CLIP's inherent efficiency in zero-shot knowledge transfer. In this paper, we propose to address two common challenges in zero-shot classification using CLIP: 1) the misalignment between textual and image embeddings, and 2) the long-tailed distribution of CLIP's training dataset. Our approach, CLIP-Enhance, is motivated by a re-interpretation of CLIP zero-shot classification as a clustering problem on a hypersphere using a von Mises-Fisher mixture model. Inspired by the DINO self-supervised learning framework, we optimize this mixture model to simultaneously improve the alignment of textual and image embeddings as well as represent data distribution disparities between training and evaluation datasets. Empirically, we show that jointly optimizing for both embedding alignment and concentration via self-supervised learning improves CLIP zero-shot classification significantly across multiple benchmark datasets. We also show empirically how CLIP-Enhance mitigates problems (1) and (2), as well as its robustness to limited data through a series of additional experiments.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11423

Loading