CR-MoE: Consistent Routed Mixture-of-Experts for Scaling Contrastive Learning

Published: 13 Feb 2024, Last Modified: 13 Feb 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: While Contrastive Learning (CL) achieves great success in many downstream tasks, its good performance heavily relies on a large model capacity. As previous methods focus on scaling dense models, training and inference costs increase rapidly with model sizes, leading to large resource consumption. In this paper, we explore CL with an efficient scaling method, Mixture of Experts (MoE), to obtain a large but sparse model. We start by plugging in the state-of-the-art CL method to MoE. However, this naive combination fails to visibly improve performance despite a much larger capacity. A closer look reveals that the naive MoE+CL model has a strong tendency to route two augmented views of the same image token to different subsets of experts: such ``cross-view instability" breaks the weight-sharing nature in CL and misleads the invariant feature learning. To address this issue, we introduce a new regularization mechanism, by enforcing expert-routing similarity between different views of the same image (or its overlapped patch tokens), while promoting expert-routing diversity of patches from different images. The resultant method, called CR-MoE, improves by 1.7 points in terms of 1\% semi-supervised learning accuracy on ImageNet, compared to the naive combination baseline. It further surpasses the state-of-the-art CL methods on ImageNet pre-training of Vision Transformer (ViT) by 2.8 points, at the same computational cost. Our findings validate CR-MoE as an effective and efficient image representation learner. Code is available at
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Incorporate the comments and updates from the rebuttal.
Supplementary Material: zip
Assigned Action Editor: ~Kui_Jia1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1523