Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective

Changyou Chen; Jianyi Zhang; Yi Xu; Liqun Chen; Jiali Duan; Yiran Chen; Son Dinh Tran; Belinda Zeng; Trishul Chilimbi

Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective

Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Dinh Tran, Belinda Zeng, Trishul Chilimbi

Published: 31 Oct 2022, Last Modified: 15 Jan 2023NeurIPS 2022 AcceptReaders: Everyone

Keywords: Bayesian data augmentation, contrastive learning, representation learning

TL;DR: A Bayesian data augmentation method to disentangle negative samples for gradient bias mitigation in contrastive learning

Abstract: Contrastive learning (CL) has been the de facto technique for self-supervised representation learning (SSL), with impressive empirical success such as multi-modal representation learning. However, traditional CL loss only considers negative samples from a minibatch, which could cause biased gradients due to the non-decomposibility of the loss. For the first time, we consider optimizing a more generalized contrastive loss, where each data sample is associated with an infinite number of negative samples. We show that directly using minibatch stochastic optimization could lead to gradient bias. To remedy this, we propose an efficient Bayesian data augmentation technique to augment the contrastive loss into a decomposable one, where standard stochastic optimization can be directly applied without gradient bias. Specifically, our augmented loss defines a joint distribution over the model parameters and the augmented parameters, which can be conveniently optimized by a proposed stochastic expectation-maximization algorithm. Our framework is more general and is related to several popular SSL algorithms. We verify our framework on both small scale models and several large foundation models, including SSL of ImageNet and SSL for vision-language representation learning. Experiment results indicate the existence of gradient bias in all cases, and demonstrate the effectiveness of the proposed method on improving previous state of the arts. Remarkably, our method can outperform the strong MoCo-v3 under the same hyper-parameter setting with only around half of the minibatch size; and also obtains strong results in the recent public benchmark ELEVATER for few-shot image classification.

Supplementary Material: pdf

24 Replies

Loading