M3E: A Unified Framework for Large-Scale Multimodal Embedding via Multi-Task Mixture-of-Experts

Lijun Zhang; Jianjie Cheng; Meng Wang; Qianlong Xie; Xingxing Wang; Wei Suo; PENG WANG

M3E: A Unified Framework for Large-Scale Multimodal Embedding via Multi-Task Mixture-of-Experts

Lijun Zhang, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Wei Suo, PENG WANG

14 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Embedding, Large Vision-Language Models, Contrastive Learning

Abstract: Universal multimodal embeddings are crucial for enabling downstream tasks such as cross-modal retrieval and retrieval-augmented generation. With the powerful semantic understanding capabilities of Large Vision-Language Models (LVLMs), leveraging them for embedding learning has emerged as a new paradigm. Recent research has primarily focused on prompt engineering or synthesizing high-quality training samples to enhance embedding quality. Although significant progress has been made, these methods often overlook the task diversity inherent in general-purpose embedding learning. This leads to two major issues: (1) The presence of too many easy or false negative samples degrades the discriminative power of the learned representations; (2) Diverse training tasks can lead to task conflict and oblivion problems. In this paper, we propose a unified multimodal multi-task embedding framework $\mathrm{ M^3E}$ that integrates innovations at both the data and model levels. On the data side, we utilize a Hard Negative-Aware Sample Scheduler (HNASS) module to increase the proportion of hard negative samples. In addition, to reduce easy negatives sample in the batch, we ensure the samples in a batch come from the same task dataset. While optimization for different tasks should be decoupled to avoide task conflicts. So on the model side, we design a Task-wise Low-Rank Mixture of Experts (Task-wise MOE) module that allocates task-specific experts to capture specialized representations, while shared experts are used to learn generalizable cross-task knowledge. This effectively mitigates inter-task conflicts and improves the stability of multi-task learning. Extensive experiments demonstrate that our method significantly improves the embedding performance of LVLMs across 36 tasks. Our code will be released.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5027

Loading