BaryBind: Binding All Modalities via Multimodal Wasserstein Barycenter Space

ICLR 2026 Conference Submission433 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal representation learning; Inter-modality balance; Wasserstein barycenter; Video understanding.
TL;DR: We present BaryBind, which aligns all modalities to a Wasserstein barycenter with a volumetric loss for scalable and balanced multimodal learning.
Abstract: Multimodal joint representation, which aligns multiple modalities in a shared latent space, has emerged as the foundation of recent multimodal understanding models. To scale beyond two modalities, existing models typically treat a specific modality (e.g., text) as the anchor to bind other modalities via pairwise contrastive losses. However, the learned joint representation space tends to be sub-optimal and imbalanced, as the modality-specific anchor may inherit the modality bias and insufficiently capture the modality-agnostic semantics and holistic geometric structures within multimodal data. In this work, we are motivated by the intuition that multimodal representations arise from different shifts from an underlying modality-agnostic representation space. Based on this, we present **BaryBind**, a multimodal framework that aligns modalities in the multimodal Wasserstein barycenter (WB) space, which inherently models a modality-agnostic distribution by minimizing the average of Wasserstein distances to all modalities. We further construct a barycenter polytope, whose volume serves as a geometric metric for quantifying $n$-modality alignment. This metric is integrated as a barycenter-anchored volumetric contrastive loss that contrasts the volumes of the $n$-dimensional polytopes, encouraging global alignment of non-anchor modalities to the barycenter while reducing inter-modality gaps. Extensive experiments show that BaryBind delivers more balanced zero-shot generalization performance in downstream tasks, e.g., cross-modal text/video retrieval and classification.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 433
Loading