Self-Supervised Open-Ended Classification with Small Visual Language Models

ICLR 2024 Workshop ME-FoMo Submission28 Authors

Published: 04 Mar 2024, Last Modified: 06 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: In-context learning, Few-shot Learning, Self-supervised Learning, Visual Language Model
TL;DR: We present Self-Context Adaptation (SeCAt), a self-supervised approach that un- locks few-shot abilities for open-ended classification with small visual language models.
Abstract: We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image, which we denote as the `self-context' sequence. Based on this signal the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research and applications in open-ended few-shot learning that otherwise requires access to large or proprietary models.
Submission Number: 28
Loading