Relative Margin for Contrastive Learning

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Contrastive Learning, Multimodal Foundation Models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose Relative Margin for multimodal contrastive learning that brings significant improvements to zero-shot image-text retrieval and image classification.
Abstract: Contrastive image-text pretraining plays an integral part behind the breakthroughs in multimodal understanding and generation in recent years. Conceptually, the contrastive loss encourages the alignment of actual image-text pairs to stand out against wrong pairs, essentially creating a separation between them. During our exploration of contrastive learning, however, we identified a practical issue that the gradients of image-text pairs drop off quickly once the separation is created, resulting in little contribution to the optimization from the large volume of higher-separation pairs. To address it, we propose to apply margins to the higherseparation training pairs to re-balance the gradient strength. We define Relative Alignment Score as the separation indicator, and incorporate a margin function that is linear to the Relative Alignment Score to adaptively increase a pair’s contribution to the optimization. We name this method Relative Margin, and observe significant performance improvements after applying it on zero-shot image-text retrieval and image classification benchmarks. Specifically, we train CoCa models with and without our proposed Relative Margin on the open LAION-2B dataset. We observe +2.4 for ViT-B and +2.6 for ViT-L on MSCOCO image-to-text retrieval recall, and +2.0 for both ViT-B and ViT-L on zero-shot ImageNet top-1 accuracy. Notably, ViT-L with Relative Margin achieves 82.4% zero-shot ImageNet top-1 accuracy when trained with open DataComp-1B dataset, outperforming previous state-of-the-arts that use larger models. Consistent improvements are also observed in few-shot linear probes of the ViT in CoCa with Relative Margin.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1514
Loading