Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Data De-Duplication, Semantic Enhancement, Large Language Model, Visual Large Language Model, Contrastive Language-Image Pre-training
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a simple and novel training strategy for vision-language representation learning to decrease the training cost without reducing the data diversity and enhance cross-modal alignment.
Abstract: Benefiting from the countless image-text pairs in the web data, vision-language pre-training models (e.g. CLIP) have emerged as an efficient alternative in learning representations that are transferable across a wide range of downstream tasks.
However, we reveal that the web data are noisy, with significant scene redundancy and misalignment in the image-text pairs, which increase the training expenses and computing resources.
To alleviate these problems, this paper proposes a novel training strategy that comprises two dedicated components, namely Data De-Duplication ($\text{D}^3$) and Semantic Enhancement (SE).
$\text{D}^3$ leverages the pre-clustered data prototypes to decrease the training cost without reducing the data diversity by uniformly sampling a portion of image-text pairs at each training epoch.
SE utilizes a large language model (LLM) and a visual large language model (VLLM) to refine and augment the text caption, which can help to form a one-to-multiple mapping relation between image and text.
Furthermore, we employ a Diverse Captions Training Mechanism (DCTM) and a Modality Self-enhancement Training Mechanism (MSTM) for effective training.
Experimental results indicate that the proposed method achieves state-of-the-art performance on various tasks including image classification, image-text retrieval, object detection, and segmentation (performance improvements varying from 0.2\% to 23.9\% for all datasets) with only half of the training time compared with original CLIP.
Our code and generated data will be publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2280
Loading