Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM): A Task-Adaptive Representation Learning Framework
Abstract: Model (M3T-UEM), a framework that advances visionlanguage
matching and retrieval by leveraging a large language
model (LLM) backbone. While concurrent LLMbased
approaches have demonstrated impressive capabilities
in multimodal and multitask scenarios; our work introduces
novel mechanisms for task-adaptive learning and
embedding extraction that further enhance the potential of
LLM-based retrieval systems. Our key technical contribution
lies in the development of a task-aware contrastive
learning framework with an automated Bayesian weighing
mechanism. This approach provides a principled way to
balance multiple tasks during training, departing from conventional
contrastive learning strategies. We further enhance
the framework through a multiple token summarization
strategy and an auxiliary language modeling objective,
which together significantly improve retrieval performance.
Comprehensive experiments on M-BEIR and ICinW
benchmarks demonstrate the effectiveness of M3T-UEM,
showing competitive or superior performance compared to
both traditional encoder-based methods and recent LLMbased
approaches. Furthermore, we demonstrate particular
strengths in handling compositional conceptual changes
and multilingual scenarios owing to the incorporation of an
LLM backbone where the method drastically outperforms
CLIP in zero-shot settings, often by orders of magnitude.
Loading