Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

Mingliang Liang, Martha A. Larson

Published: 2023, Last Modified: 07 Nov 2024LGM3A@MM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we introduce Subsampling of frequent Words for Contrastive Language-Image Pre-training (SW-CLIP), a novel approach for the training Vision-Language Models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously proposed to train skip-gram models in natural language processing and applies it to the textual training data of VLMs. We report on experiments that demonstrate the ability of frequency-based subsampling to speed up training and also to deliver a substantial improvement in accuracy in a number of downstream zero-shot (i.e., transfer) classification tasks. We notice that the classification test sets on which SW-CLIP seems to be particularly effective are those in which the labels of the classes occur infrequently as words in the training data, and thus have a high probability of being retained during frequency-based subsampling of the model training data. Overall, the advantages of SW-CLIP demonstrated in this paper serves to motivated further future work in text subsampling for the training of VLMs. Our code and pre-trained weights are available at https://github.com/Anastasiais-ml/sw_clip.git