When Noises Help: Improve Text-Image Multimodal Contrastive Learning with Stochastic Label Augmentations
Abstract: Contrastive learning~(CL) has been widely used for self-supervised representation learning in text-image multimodal representation learning. However, there are two setbacks in the SOTA contrastive learning framework. One lies in the design of contrastive learning, where the model aims to pull together positive pairs and push away negative pairs. For one image, CL only considers one unique text as its positive sample, and treat all remaining text data as negative samples. Such design inevitably brings in learning bias towards overfitting into specific data pairs. Another setback comes from the web-crawled datasets that are commonly used in CL such as Conceptual Caption, YFCC and LAION. These datasets brings benefit due to its large size, yet contain significant noisy or vague labels. In this paper, we examine how augmenting the ground-truth labels with randomness can bring significant improvements in text-image multimodal contrastive learning. Through the simple addition of noise to ground-truth labels, we observe substantial improvements in model performance and robustness, requiring no additional computational overhead. We introduce three distinct stochastic label augmentation strategies and evaluate their effectiveness across various benchmarks, including zero-shot transfer, distribution shift, and linear probing tasks. Furthermore, we conduct comprehensive experiments involving different model architectures and noise rates, demonstrating the generalizability and substantial benefits of stochastic label augmentation across diverse tasks and models.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies
Loading