Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment

Sidan Zhu, Dixin Luo

Published: 01 Jan 2024, Last Modified: 14 Nov 2024PRCV (11) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most existing multi-modal representation learning methods require well-aligned multi-modal data (e.g., image-text pairs). This setting, however, limits their applications because real-world multi-modal data are often partially-aligned, consisting of a small piece of well-aligned data and a massive amount of unaligned ones. In this study, we propose a novel optimal transport-based method to enhance multi-modal contrastive learning given partially-aligned multi-modal data, which provides an effective strategy to leverage the information hidden in the unaligned multi-modal data. The proposed method imposes an optimal transport (OT) regularizer in the multi-modal contrastive learning framework, aligning the latent representations of different modalities with consistency guarantees. We implement the OT regularizer in two ways, based on a consistency-regularized loop of pairwise Wasserstein distances and a Wasserstein barycenter problem, respectively. We analyze the rationality of our OT regularizer and compare its two implementations in-depth. Experiments show that combining our OT regularizer with state-of-the-art contrastive learning methods leads to better performance in the generalized zero-shot cross-modal retrieval and multi-modal classification tasks.