Data De-Duplication and Semantic Enhancement for Contrastive Language-Image Pre-training

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Data De-Duplication, Semantic Enhancement, Large Language Model, Visual Large Language Model, Contrastive Language-Image Pre-training
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a simple and novel training strategy for vision-language representation learning to decrease the training cost without reducing the data diversity and enhance cross-modal alignment.
Abstract: Benefiting from the countless image-text pairs in the web data, vision-language pre-training models (e.g. CLIP) have emerged as an efficient alternative in learning representations that are transferable across a wide range of downstream tasks. However, we reveal that the web data are noisy, with significant scene redundancy and misalignment in the image-text pairs, which increase the training expenses and computing resources. To alleviate these problems, this paper proposes a novel training strategy that comprises two dedicated components, namely Data De-Duplication ($\text{D}^3$) and Semantic Enhancement (SE). $\text{D}^3$ leverages the pre-clustered data prototypes to decrease the training cost without reducing the data diversity by uniformly sampling a portion of image-text pairs at each training epoch. SE utilizes a large language model (LLM) and a visual large language model (VLLM) to refine and augment the text caption, which can help to form a one-to-multiple mapping relation between image and text. Furthermore, we employ a Diverse Captions Training Mechanism (DCTM) and a Modality Self-enhancement Training Mechanism (MSTM) for effective training. Experimental results indicate that the proposed method achieves state-of-the-art performance on various tasks including image classification, image-text retrieval, object detection, and segmentation (performance improvements varying from 0.2\% to 23.9\% for all datasets) with only half of the training time compared with original CLIP. Our code and generated data will be publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2280
Loading