M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei; Xiaolong Cheng; Qi Qin; Dan Wang; Huazhen Huang; Qingqing Gu; Yetao Wu; Luo Ji

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Huazhen Huang, Qingqing Gu, Yetao Wu, Luo Ji

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: We propose M3-JEPA, which implements a multi-gate MoE predictor to JEPA, and shows theoretical optimality and reasonable performance on multimodal tasks..

Abstract: Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

Lay Summary: Modern AI systems need to connect different types of information, like image, text, and audio. Current methods generally predict one type of information directly, usually in the word or pixel level, given another type of information. However, they sometimes suffer from the sampling noise, information bias or content ambiguity, therefore produces inaccurate results. This work introduces M3-JEPA, a new AI method that better connects different information types, primarily understand their contents in the deeper, hidden space of modeling. of data by understanding their deeper meaning. We also employ a special system called a “mixture of experts”, which allows different networks to study different field-specific knowledge, and automatically select the most appropriate expert in the current situation. We show that M3-JEPA has reasonable experimental performance, good adaptability to unseen conditions, and higher computational efficiency. Overall, M3-JEPA offers a new path to the artificial general intelligence, by a better modeling of the natural world.

Link To Code: https://github.com/HongyangLL/M3-JEPA

Primary Area: Deep Learning->Self-Supervised Learning

Keywords: JEPA, MoE, multimodal, alignment

Submission Number: 15673

Loading