When Noises Help: Improve Text-Image Multimodal Contrastive Learning with Stochastic Label Augmentations

Anonymous

When Noises Help: Improve Text-Image Multimodal Contrastive Learning with Stochastic Label Augmentations

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Contrastive learning~(CL) has been widely used for self-supervised representation learning in text-image multimodal representation learning. However, there are two setbacks in the SOTA contrastive learning framework. One lies in the design of contrastive learning, where the model aims to pull together positive pairs and push away negative pairs. For one image, CL only considers one unique text as its positive sample, and treat all remaining text data as negative samples. Such design inevitably brings in learning bias towards overfitting into specific data pairs. Another setback comes from the web-crawled datasets that are commonly used in CL such as Conceptual Caption, YFCC and LAION. These datasets brings benefit due to its large size, yet contain significant noisy or vague labels. In this paper, we examine how augmenting the ground-truth labels with randomness can bring significant improvements in text-image multimodal contrastive learning. Through the simple addition of noise to ground-truth labels, we observe substantial improvements in model performance and robustness, requiring no additional computational overhead. We introduce three distinct stochastic label augmentation strategies and evaluate their effectiveness across various benchmarks, including zero-shot transfer, distribution shift, and linear probing tasks. Furthermore, we conduct comprehensive experiments involving different model architectures and noise rates, demonstrating the generalizability and substantial benefits of stochastic label augmentation across diverse tasks and models.

Paper Type: short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models

Languages Studied: English

0 Replies

Loading