Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework

Rohan Sharma, Changyou Chen, Feng-Ju Chang, Seongjun Yun, Xiaohu Xie, Rui Meng, Dehong Xu, Alejandro Mottini, Qingjun Cui

Published: 19 Oct 2025, Last Modified: 26 Jan 2026ICCVEveryoneRevisionsCC BY 4.0

Abstract: Model (M3T-UEM), a framework that advances visionlanguage matching and retrieval by leveraging a large language model (LLM) backbone. While concurrent LLMbased approaches have demonstrated impressive capabilities in multimodal and multitask scenarios; our work introduces novel mechanisms for task-adaptive learning and embedding extraction that further enhance the potential of LLM-based retrieval systems. Our key technical contribution lies in the development of a task-aware contrastive learning framework with an automated Bayesian weighing mechanism. This approach provides a principled way to balance multiple tasks during training, departing from conventional contrastive learning strategies. We further enhance the framework through a multiple token summarization strategy and an auxiliary language modeling objective, which together significantly improve retrieval performance. Comprehensive experiments on M-BEIR and ICinW benchmarks demonstrate the effectiveness of M3T-UEM, showing competitive or superior performance compared to both traditional encoder-based methods and recent LLMbased approaches. Furthermore, we demonstrate particular strengths in handling compositional conceptual changes and multilingual scenarios owing to the incorporation of an LLM backbone where the method drastically outperforms CLIP in zero-shot settings, often by orders of magnitude.