# ProMoBal: Prototype-guided Modality Balancing in multimodal contrastive learning


# Abstract

Multimodal learning often suffers from \emph{modality imbalance}, where dominant modalities overshadow weaker ones and unimodal encoders lack a shared representational goal. 
We propose a new end-to-end multimodal supervised contrastive learning framework, Prototype-guided Modality contribution Balancing (ProMoBal), that integrates prototype-centered multimodal representation learning with sample-adaptive fusion. 
At its core, ProMoBal promotes a new regular simplex geometry for multimodal representation learning, 
where class prototypes are symmetrically arranged on a shared hypersphere that consistently spans both unimodal and fused representation spaces.
This geometry provides a common reference for aligning unimodal and fused embeddings, 
while the proposed adaptive fusion mechanism mitigates modality balance on a per-sample basis.
Extensive experiments with five benchmark datasets---spanning audio–video, image–text, and three-modality gesture recognition---show that ProMoBal consistently outperforms state-of-the-art multimodal supervised learning methods, achieving accuracy gains of up to 21% over unimodal baselines.


# Environment Setup

To set up the environment for this project, ensure you have the following dependencies:

- Pytorch
- torchvision
- and other required libraries


# Datasets

- CREMA-D
- Kinetics-Sounds
- Twitter15
- Sarcasm
- NVGesture


# Usage

## CREMA-D and KineticsSounds datasets

To train the model using this method, run:

```bash
python main.py
```

To perform inference with the trained model, run:

```bash
python main.py --train False
```


## NVGesture dataset

To train the model using this method, run:

```bash
python main_nv.py
```

To perform inference with the trained model, run:

```bash
python main_nv.py --train False
```
