From Token Imbalance to Balanced Routing: An ELBO-Regularized Probabilistic Framework for Contrastive Multimodal Learning
TL;DR: CoPRIME is a probabilistic routing framework that tackles extreme token imbalance in multimodal learning, especially between spectrogram-tokenized audio and text, by combining contrastive pretraining with an ELBO-regularized mixture of experts.
Abstract: We introduce CoPRIME (Contrastive Probabilistic Routing for IMbalanced tokens with ELBO-regularized mixture of experts), a probabilistic routing framework for multimodal representation learning that generalizes multimodal representation learning beyond vision-text by tackling the fundamental challenge of extreme token imbalance across modalities. This imbalanced-ness is particularly pronounced between spectrogram-tokenized audio and text. CoPRIME augments contrastive pretraining with an ELBO-regularized routing objective that jointly promotes 1) expert specialization, requiring experts to explain the tokens they receive, and 2) diverse utilization via KL regularization to a uniform prior. To stabilize routing, we further replace standard CoV-based regularizers with entropy-based importance and load losses, yielding smoother gradients and flexible, modality-aware routing without rigid uniformity constraints. On MOSEI and IEMOCAP datasets, CoPRIME achieves state-of-the-art zero- and few-shot emotion and sentiment results, outperforming dense Transformers and prior multimodal MoE variants while retaining the efficiency of sparse conditional computation. Ablations isolate the role of each loss and show that ELBO is the primary driver of stable specialization under modality imbalance, with entropy-based regularizers further improving convergence and utilization.
Submission Number: 1062
Loading