Keywords: multimodal learning, modality imbalance, supervised contrastive learning, prototypes
TL;DR: We propose a new prototype-guided multimodal representation learning framework that aligns unimodal and fused embeddings with a shared simplex geometry while adaptively balancing modality contributions, achieving SOTA results across five benchmarks.
Abstract: Multimodal learning often suffers from modality imbalance, where dominant modalities overshadow weaker ones and unimodal encoders lack a shared representational goal.
We propose a new end-to-end multimodal supervised contrastive learning framework, Prototype-guided Modality contribution Balancing (ProMoBal), that integrates prototype-centered multimodal representation learning with sample-adaptive fusion.
At its core, ProMoBal promotes a new regular simplex geometry for multimodal representation learning,
where class prototypes are symmetrically arranged on a shared hypersphere that consistently spans both unimodal and fused representation spaces.
This geometry provides a common reference for aligning unimodal and fused embeddings,
while the proposed adaptive fusion mechanism mitigates modality balance on a per-sample basis.
Extensive experiments with five benchmark datasets---spanning audio–video, image–text, and three-modality gesture recognition---show that ProMoBal consistently outperforms state-of-the-art multimodal supervised learning methods, achieving accuracy gains of up to 21% over unimodal baselines.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16251
Loading