ProMoBal: Prototype-guided Modality Balancing in multimodal contrastive learning

ProMoBal: Prototype-guided Modality Balancing in multimodal contrastive learning

ICLR 2026 Conference Submission16251 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal learning, modality imbalance, supervised contrastive learning, prototypes

TL;DR: We propose a new prototype-guided multimodal representation learning framework that aligns unimodal and fused embeddings with a shared simplex geometry while adaptively balancing modality contributions, achieving SOTA results across five benchmarks.

Abstract: Multimodal learning often suffers from modality imbalance, where dominant modalities overshadow weaker ones and unimodal encoders lack a shared representational goal. We propose a new end-to-end multimodal supervised contrastive learning framework, Prototype-guided Modality contribution Balancing (ProMoBal), that integrates prototype-centered multimodal representation learning with sample-adaptive fusion. At its core, ProMoBal promotes a new regular simplex geometry for multimodal representation learning, where class prototypes are symmetrically arranged on a shared hypersphere that consistently spans both unimodal and fused representation spaces. This geometry provides a common reference for aligning unimodal and fused embeddings, while the proposed adaptive fusion mechanism mitigates modality balance on a per-sample basis. Extensive experiments with five benchmark datasets---spanning audio–video, image–text, and three-modality gesture recognition---show that ProMoBal consistently outperforms state-of-the-art multimodal supervised learning methods, achieving accuracy gains of up to 21% over unimodal baselines.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 16251

Loading