Visual Language Alignment Tuning

Le Zhang; Qian Yang; Aishwarya Agrawal

Visual Language Alignment Tuning

Le Zhang, Qian Yang, Aishwarya Agrawal

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Self-supervised learning

TL;DR: We present a simple tuning framework to train

Abstract: Foundation models like CLIP are pivotal for advancing research in vision-language learning, as they simultaneously learn modality-specific representations and cross-modal alignment. However, training these models is resource-intensive, requiring hundreds of millions of image-text pairs and hundreds of GPUs, creating a barrier for advancing research on multimodal alignment. In this paper, we introduce the \textbf{S}wift \textbf{A}lignment of \textbf{I}mage and \textbf{L}anguage (SAIL) framework, which focuses on vision-language alignment by tuning a lightweight alignment layer added on top of frozen pretrained single-modality models. SAIL drastically reduces computational demands, requiring only a single GPU to align the pretrained feature spaces.

Submission Number: 106

Loading